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Abstract.  Consider  a  dataset  of  vector-valued  observations  that  consists  of  a  modest  number  of 
noisy  inliers,  which  are  explained  well  by  a  low-dimensional  subspace,  along  with  a  large  number  of 
outliers,  which  have  no  linear  structure.  This  work  describes  a  convex  optimization  problem,  called 
reaper,  that  can  reliably  fit  a  low-dimensional  model  to  this  type  of  data.  The  paper  provides 
an  efficient  algorithm  for  solving  the  reaper  problem,  and  it  documents  numerical  experiments 
which  confirm  that  reaper  can  dependably  find  linear  structure  in  synthetic  and  natural  data.  In 
addition,  when  the  inliers  are  contained  in  a  low-dimensional  subspace,  there  is  a  rigorous  theory 
that  describes  when  reaper  can  recover  the  subspace  exactly. 


1.  Introduction 

Low-dimensional  linear  models  have  applications  in  a  huge  array  of  data  analysis  problems.  Let 
us  highlight  some  examples  from  computer  vision,  machine  learning,  and  bioinformatics. 

Illumination  models:  Images  of  a  face — or  any  Lambertian  object — viewed  under  different  illu¬ 
mination  conditions  are  contained  in  a  nine-dimensional  subspace  [EHY95,  HYL+03,  BJ03]. 

Structure  from  motion:  Feature  points  on  a  moving  rigid  body  lie  on  an  affine  space  of  dimen¬ 
sion  three,  assuming  the  affine  camera  model  [CK98].  More  generally,  estimating  structure 
from  motion  involves  estimating  low-rank  matrices  [EvdHIO,  Sec.  5.2]. 

Latent  semantic  indexing:  We  can  describe  a  large  corpus  of  documents  that  concern  a  small 
number  of  topics  using  a  low-dimensional  linear  model  [DDL+88]. 

Population  stratification:  Low-dimensional  models  of  single  nucleotide  polymorphism  (SNP) 
data  have  been  used  to  show  that  the  genotype  of  an  individual  is  correlated  with  her 
geographical  ancestry  [NJB+08].  More  generally,  linear  models  are  used  to  assess  differences 
in  allele  frequencies  among  populations  [PPP+06]. 

In  most  of  these  applications,  the  datasets  are  noisy,  and  they  contain  a  substantial  number  of 
outliers.  Principal  component  analysis,  the  standard  method  for  finding  a  low-dimensional  linear 
model,  is  sensitive  to  these  non-idealities.  As  a  consequence,  good  robust  modeling  techniques 
would  be  welcome  in  a  range  of  scientific  and  engineering  disciplines. 

The  purpose  of  this  work  is  to  introduce  a  new  method  for  fitting  low-dimensional  linear  models. 
The  approach  is  robust  against  noise  in  the  inliers,  and  it  can  cope  with  a  huge  number  of  outliers. 
Our  formulation  involves  a  convex  optimization  problem,  and  we  describe  an  efficient  numerical 
algorithm  that  is  guaranteed  to  produce  an  approximate  solution  after  a  modest  number  of  spectral 
calculations.  We  include  experiments  with  synthetic  and  natural  data  to  verify  that  our  technique 
reliably  seeks  out  linear  structure.  Furthermore,  under  the  ideal  assumption  that  the  inliers  are 
contained  in  a  subspace,  we  develop  sufficient  conditions  for  the  method  to  identify  the  subspace 
without  error. 
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1.1.  Notation  and  Preliminaries.  We  write  ||-||  for  the  £2  norm  on  vectors  and  the  spectral 
norm  on  matrices;  ||-||F  represents  the  Frobenius  norm.  Angle  brackets  (•,  •)  denote  the  standard 
inner  product  on  vectors  and  matrices,  and  tr  refers  to  the  trace.  The  orthogonal  complement 
of  a  subspace  L  is  expressed  as  TV  The  curly  inequality  denotes  the  semidehnite  order:  For 
symmetric  matrices  A  and  B,  we  write  A  =4  B  if  and  only  if  B  —  A  is  positive  semidehnite. 

An  orthoprojector  is  a  symmetric  matrix  II  that  satisfies  II2  =  II.  Each  subspace  L  in  is  the 
range  of  a  unique  Dx  D  orthoprojector  11^.  The  trace  of  an  orthoprojector  equals  the  dimension  of 
its  range:  tr(IIi)  =  dim(L).  For  each  point  x  G  the  image  II lx  is  the  best  £ 2  approximation 
of  x  in  the  subspace  L. 

Finally,  we  introduce  the  spherization  transform  for  vectors: 

{*/ M'  (1.1) 

(0,  otherwise. 

We  extend  the  spherization  transform  to  matrices  by  applying  it  separately  to  each  column. 

1.2.  Linear  Modeling  by  Principal  Component  Analysis.  Let  A  be  a  set  of  N  distinct 
observations  in  WD .  Suppose  we  wish  to  determine  a  d-dimensional  subspace  that  best  explains  the 
data.  For  each  point,  we  can  measure  the  residual  error  in  the  approximation  by  computing  the 
orthogonal  distance  from  the  point  to  the  subspace.  The  classical  method  for  fitting  a  subspace 
asks  us  to  minimize  the  sum  of  the  squared  residuals: 

minimize  ^  \\x  —  Ha; 1 1 2  subject  to  II  is  an  orthoprojector  and  trll  =  d.  (1-2) 

x&X 

This  approach  is  equivalent  with  the  method  of  principal  component  analysis  (PCA)  from  the 
statistics  literature  [Jol02]  and  the  total  least  squares  (TLS)  method  from  the  linear  algebra  com¬ 
munity  [vHV87]. 

The  mathematical  program  (1.2)  is  not  convex  because  orthoprojectors  do  not  form  a  convex 
set,  so  we  have  no  right  to  expect  that  the  problem  is  tractable.  Nevertheless,  we  can  compute  an 
analytic  solution  using  a  singular  value  decomposition  (SVD)  of  the  data  [EY39,  vHV87].  Suppose 
that  X  is  a  D  x  N  matrix  whose  columns  are  the  data  points,  arranged  in  fixed  order,  and  let 
X  =  UYIV1  be  an  SVD  of  this  matrix.  Define  U d  by  extracting  the  first  d  columns  of  U ;  the 
columns  of  U&  are  often  called  the  principal  components  of  the  data.  Then  we  can  construct  an 
optimal  point  II*  for  (1.2)  using  the  formula  II*  =  U^U^. 

1.3.  Classical  Methods  for  Achieving  Robustness.  Imagine  now  that  the  dataset  X  contains 
inkers,  points  we  hope  to  explain  with  a  linear  model,  as  well  as  outliers ,  points  that  come  from 
another  process,  such  as  a  different  population  or  noise.  The  data  are  not  labeled,  so  it  may  be 
challenging  to  distinguish  inkers  from  outliers.  If  we  apply  the  PCA  formulation  (1.2)  to  fit  a 
subspace  to  X ,  the  rogue  points  can  interfere  with  the  linear  model  for  the  inkers. 

To  guard  the  subspace  estimation  procedure  against  outliers,  statisticians  have  proposed  to 
replace  the  sum  of  squares  in  (1.2)  with  a  figure  of  merit  that  is  less  sensitive  to  outliers.  One 
possibility  is  to  sum  the  unsquared  residuals,  which  reduces  the  contribution  from  large  residuals 
that  may  result  from  aberrant  data  points.  This  idea  leads  to  the  following  optimization  problem. 

minimize  ^  ||a5  —  IIa?||  subject  to  II  is  an  orthoprojector  and  trn  =  d.  (1-3) 

xex 

In  case  d  =  D  —  1,  the  problem  (1.3)  is  sometimes  called  orthogonal  l\  regression  [SW87]  or 
least  orthogonal  absolute  deviations  [Nyq88].  The  extension  to  general  d  is  apparently  more  re¬ 
cent  [WatOl,  DZHZ06].  See  the  books  [HR09,  RL87,  MMY06]  for  an  extensive  discussion  of  other 
ways  to  combine  residuals  to  obtain  robust  estimators. 

Unfortunately,  the  mathematical  program  (1.3)  is  not  convex,  and,  in  contrast  to  (1.2),  no  deus 
ex  machina  emerges  to  make  the  problem  tractable.  Although  there  are  many  algorithms  [Nyq88, 
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Algorithm  1.1  Prototype  algorithm  for  robust  computation  of  a  linear  model 
Input: 

•  A  set  X  of  observations  in  WD 

•  The  target  dimension  d  for  the  linear  model,  where  0  <  d  <  D 
Output: 

•  A  [d]  -dimensional  subspace  of  M22 
Procedure: 

1  [Opt.]  Solve  (1.5)  to  obtain  a  center  c*,  and  center  the  data:  x  <—  *  —  c*  for  each  x  £  X 

2  [Opt.]  Spherize  the  data:  x  x/  ||cc||  for  each  nonzero  x  £  X 

3  Solve  (1.4)  with  dataset  X  and  parameter  d  to  obtain  an  optimal  point  P* 

4  Output  a  dominant  [ d\  -dimensional  invariant  subspace  of  _P* 


CM91,  BH93,  Wat02,  DZHZ06,  ZSL09]  that  attempt  (1.3),  none  is  guaranteed  to  return  a  global 
minimum.  In  fact,  most  of  the  classical  proposals  for  robust  linear  modeling  involve  intractable 
optimization  problems,  which  makes  them  poor  options  for  computation  in  spite  of  their  theoretical 
properties  [MMY06]. 

1.4.  A  Convex  Program  for  Robust  Linear  Modeling.  The  goal  of  this  paper  is  to  develop, 
analyze,  and  test  a  rigorous  method  for  fitting  robust  linear  models  by  means  of  convex  optimization. 
We  propose  to  relax  the  hard  optimization  problem  (1.3)  by  replacing  the  nonconvex  constraint  set 
with  a  larger  convex  set.  The  advantage  of  this  approach  is  that  we  can  solve  the  resulting  convex 
program  completely  using  a  variety  of  efficient  algorithms. 

The  idea  behind  our  relaxation  is  straightforward.  Each  eigenvalue  of  an  orthoprojector  II  equals 
zero  or  one  because  II2  =  II.  Although  a  0-1  constraint  on  eigenvalues  is  hard  to  enforce,  the 
symmetric  matrices  whose  eigenvalues  lie  in  the  interval  [0, 1]  form  a  convex  set.  This  observation 
leads  us  to  frame  the  following  convex  optimization  problem.  Given  a  dataset  X  in  and  a  target 
dimension  d  6  (0,  D )  for  the  linear  model,  we  solve 

minimize  ||a  —  Px\\  subject  to  0  P  ^  I  and  tr  P  =  d.  (1-4) 

x&X 

We  refer  to  (1.4)  as  reaper  because  it  harvests  linear  structure  from  data.  The  formulation  (1.4) 
refines  a  proposal  from  the  paper  [ZL11],  which  describes  a  looser  relaxation  of  (1.3). 

In  many  cases,  the  optimal  set  of  (1.4)  consists  of  a  single  orthoprojector  whose  range  provides 
an  excellent  fit  for  the  data.  The  bulk  of  this  paper  provides  theoretical  and  empirical  support  for 
this  observation.  In  Section  4,  we  present  a  numerical  algorithm  for  solving  (1.4). 

1.4.1.  Some  Practical  Matters.  Although  the  reaper  formulation  is  effective  on  its  own,  we  can 
usually  obtain  better  linear  models  if  we  preprocess  the  data  before  solving  (1.4).  Let  us  summarize 
the  recommended  procedure,  which  appears  as  Algorithm  1.1. 

First,  the  reaper  problem  assumes  that  the  inkers  are  approximately  centered.  When  they  are 
not,  it  is  important  to  identify  a  centering  point  c*  for  the  dataset  and  to  work  with  the  centered 
observations.  We  can  compute  a  centering  point  c*  robustly  by  solving  the  Euclidean  median 
problem  [HR09,  MMY06]: 

minimize  ^  j|cc  —  c||  .  (1-5) 

It  is  also  possible  to  incorporate  centering  by  modifying  the  optimization  problem  (1.4);  see  Sec¬ 
tion  6.1.5  for  more  information. 

Second,  the  reaper  formulation  can  be  sensitive  to  outliers  with  large  magnitude.  A  simple 
but  powerful  method  for  addressing  this  challenge  is  to  spherize  the  data  points  before  solving  the 
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optimization  problem.  For  future  reference,  we  write  down  the  resulting  convex  program. 

minimize  \\x  —  Px\\  subject  to  0  P  ^  I  and  tr  P  =  d.  (1-6) 

The  tilde  denotes  the  spherization  transform  (1.1).  We  refer  to  (1.6)  as  the  S-REAPER  problem.  In 
most  (but  not  all)  of  our  experimental  work,  we  have  found  that  S-REAPER  outperforms  reaper. 
The  idea  of  spherizing  data  before  fitting  a  subspace  was  proposed  in  the  paper  [LMS+99],  where  it 
is  called  spherical  PC  A.  Spherical  PC  A  is  regarded  as  one  of  the  most  effective  current  techniques 
for  robust  linear  modeling  [MMY06]. 

Finally,  we  regard  the  parameter  d  in  reaper  and  S-REAPER  as  a  proxy  for  the  dimension  of 
the  linear  model.  While  the  rank  of  an  optimal  solution  P*  to  (1.4)  or  (1.6)  cannot  be  smaller 
than  d  because  of  the  constraints  tr  P  =  d  and  P  =4  1,  the  rank  of  P*  often  exceeds  d.  Our 
numerical  experience  indicates  that  the  dominant  eigenvectors  of  P*  do  not  change  much  as  we 
adjust  d.  Therefore,  if  a  model  with  exact  dimension  d  is  required,  we  recommend  using  a  dominant 
d-dimensional  invariant  subspace  of  P*.  See  Section  6.1.6  for  alternative  approaches. 
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Figure  1.1.  A  needle  in  a  haystack.  The  needle  dataset  is  described  in  Section  1.5. 
[Top]  The  projection  onto  a  random  two-dimensional  subspace.  [Bottom]  The  pro¬ 
jection  onto  the  oracle  two-dimensional  subspace  obtained  by  applying  PCA  to  the 
inliers  only.  The  blue  squares  mark  the  projection  of  the  inkers  onto  the  one¬ 
dimensional  model  computed  with  S-REAPER  (1.6). 
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1.5.  Finding  a  Needle  in  a  Haystack.  The  reaper  and  S-REAPER  formulations  work  aston¬ 
ishingly  well.  They  can  reliably  identify  linear  structure  in  situations  where  the  task  might  seem 
impossible.  As  a  first  illustration,  we  examine  the  problem  of  “finding  a  needle  in  a  haystack.”  In 
Section  5,  we  present  a  more  comprehensive  suite  of  experiments. 

Let  the  ambient  dimension  D  =  100.  Consider  a  dataset  X  in  that  consists  of  6  points  arrayed 
arbitrarily  on  a  line  through  the  origin  (the  needle )  and  200  points  drawn  from  a  centered  normal 
distribution  with  covariance  D~1I  (the  haystack).  The  6  inkers  have  strong  linear  structure,  but 
they  comprise  only  3%  of  the  data.  The  200  outliers,  making  up  the  bulk  of  the  sample,  typically 
have  no  linear  structure  at  all.  We  solve  the  optimization  problem  S-REAPER  with  the  parameter 
d  =  1  to  obtain  an  optimal  point  P*,  which  happens  to  be  a  rank-one  orthoprojector. 

Figure  1.1  displays  the  outcome  of  this  experiment.  We  see  that  the  range  of  the  S-REAPER 
model  P*  identifies  the  direction  of  the  needle  with  negligible  errorl  Of  course,  this  feat  is  not 
magic.  Even  though  the  needle  is  impossible  to  see  in  most  two-dimensional  projections  of  the 
data,  it  becomes  visible  once  we  identify  the  direction  of  the  needle.  We  find  it  remarkable  that 
S-REAPER  renders  the  search  for  this  direction  into  a  convex  problem. 
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Figure  1.2.  A  syringe  in  a  haystack.  The  syringe  dataset  is  described  in  Sec¬ 
tion  1.5.  [Top]  The  projection  onto  a  random  two-dimensional  subspace.  [Bottom] 
The  projection  onto  the  oracle  two-dimensional  model  determined  by  applying  PCA 
to  the  inliers  only.  The  blue  squares  mark  the  projection  of  the  inkers  onto  the  one¬ 
dimensional  model  computed  with  S-REAPER  (1.6). 
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Our  method  for  fitting  linear  models  is  robust  to  noise  in  the  inliers,  although  it  may  require  more 
data  to  fit  the  model  accurately.  To  illustrate,  we  consider  a  noisy  version  of  the  same  problem. 
We  retain  the  ambient  dimension  D  =  100.  Each  of  10  inliers  takes  the  form  Xi  =  g^v  +  z?; 
where  v  is  a  fixed  vector,  g j  are  i.i.d.  standard  normal  variables,  and  the  noise  vectors  Zj  are 
i.i.d.  normal(0,  (16.D)-1I).  The  inliers  form  a  tube  around  the  vector  v ,  so  they  look  like  a 
syringe.  As  before,  the  outliers  consist  of  200  points  that  are  i.i.d.  normal(0,  .D_1I).  We  apply 
S-REAPER  with  d  =  1  to  the  full  dataset  to  obtain  a  solution  P*,  and  we  use  a  dominant  eigenvector 
u±  of  P*  as  our  one-dimensional  model.  To  find  the  optimal  two-dimensional  model  for  the  inliers, 
we  solve  (1.2)  with  d  =  2  using  the  inliers  only  to  obtain  two  oracle  principal  components.  Figure  1.2 
illustrates  the  outcome  of  this  experiment.  The  angle  between  the  first  oracle  principal  component 
and  the  S-REAPER  model  is  approximately  3.60°. 

1.6.  The  Haystack  Model.  Inspired  by  the  success  of  these  experiments,  let  us  attempt  to  place 
them  on  a  firm  theoretical  foundation.  To  that  end,  we  propose  a  very  simple  generative  random 
model.  We  want  to  capture  the  idea  that  the  inliers  admit  a  low-dimensional  linear  model,  while 
the  outliers  are  totally  unstructured. 

Table  1.1.  The  Haystack  Model.  A  generative  ran¬ 
dom  model  for  data  with  linear  structure  that  is  con¬ 
taminated  with  outliers. 


D 

Dimension  of  the  ambient  space 

L 

A  proper  d-dimensional  subspace  of  containing  the  inliers 

Ain 

Number  of  inliers 

AOUt 

Number  of  outliers 

Pin 

Inlier  sampling  ratio  pm  :=  N\n/d 

Pont 

Outlier  sampling  ratio  pout  :=  Nout/D 

°in 

Variance  of  the  inliers 

CTout 

Variance  of  the  outliers 

Ain 

Set  of  Ar;n  inliers,  drawn  i.i.d.  normal(0,  (ofn/d)  n^) 

S^out 

Set  of  Aout  outliers,  drawn  i.i.d.  normal(0,  (<r2ut/D)  I/)) 

X 

The  set  Xm  U  Aout  containing  all  the  data  points 

There  are  a  few  useful  intuitions  associated  with  this  model.  As  the  inlier  sampling  ratio  pia 
increases,  the  inliers  fill  out  the  subspace  L  more  completely  so  the  linear  structure  becomes  more 
evident.  As  the  outlier  sampling  ratio  pQ ut  increases,  the  outliers  become  more  distracting  and  they 
may  even  start  to  exhibit  some  linear  structure  due  to  chance.  Next,  observe  that  we  have  scaled 
the  points  so  that  their  energy  is  independent  of  the  dimensional  parameters: 

E  ||a:||2  =  ofn  for  x  G  Ajn  and  E  ||a:||2  =  cr2ut  for  x  £  Xout. 

As  a  result,  when  ofn  =  ofmt ,  we  cannot  screen  outliers  just  by  looking  at  their  energy.  The  sampling 
ratios  and  the  variances  contain  most  of  the  information  about  the  behavior  of  this  model. 

1.7.  Exact  Recovery  under  the  Haystack  Model.  We  have  been  able  to  establish  conditions 
under  which  reaper  and  S-REAPER  provably  find  the  low-dimensional  subspace  L  in  the  Haystack 
Model  with  high  probability.  A  sufficient  condition  for  exact  recovery  is  that  pm ,  the  number  of 
inliers  per  subspace  dimension,  should  increase  linearly  with  pout,  the  number  of  outliers  per  ambient 
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dimension.  As  a  consequence,  we  can  find  a  low-dimensional  linear  structure  in  a  high-dimensional 
space  given  a  tiny  number  of  examples,  even  when  the  number  of  outliers  seems  exorbitant. 

Theorem  1.1  (Exact  Recovery  for  the  Haystack  Model).  Fix  a  number  (3  >  0,  and  assume  that 
1  <  d  <  (D  —  l)/2.  Let  L  be  an  arbitrary  d-dimensional  subspace  o/MD,  and  draw  the  dataset  X 
at  random  according  to  the  Haystack  Model  on  page  6. 

(i)  Suppose  that  the  sampling  ratios  and  the  variances  satisfy  the  relation 

Pin  >  Ci  +  C2  /3  +  C3  •  - 1  •  (/3out  +  1  +  4 (3) .  (1-7) 

°m 

Then  11^  is  the  unique  solution  to  the  reaper  problem  (1.4)  except  with  probability  4e 

(ii)  Suppose  that  the  sampling  ratios  satisfy  the  relation 

Pin  >  Ci  +  C2  (3  +  C3  Pout-  (1-8) 

Then  II l  is  the  unique  solution  to  the  S-REAPER  problem  (1.6)  except  with  probability  4  e 
The  numbers  C *  and  C i  are  positive  universal  constants. 

Theorem  1.1  is  a  consequence  of  a  deterministic  exact  recovery  condition  that  we  discuss  more 
fully  in  Section  3.  We  obtain  Theorem  1.1  when  we  compute  the  values  of  some  summary  statistics, 
assuming  that  the  data  is  drawn  from  the  Haystack  Model.  See  Appendix  B  for  the  details  of  this 
argument,  including  tighter  conditions  that  hold  for  the  full  parameter  range  1  <  d  <  D  —  1. 

For  clarity,  we  have  suppressed  the  values  of  the  constants  in  (1.7)  and  (1.8).  Our  proof  yields 
reasonable  numerical  estimates.  For  example,  Ci  <  13  and  C2  <  7  and  C3  <  16.  Experiments 
indicate  that  C3,  the  constant  of  proportionality  between  p-m  and  pout)  is  less  than  two. 

1.7.1.  A  Needle  in  a  Haystack.  To  complement  the  experiments  in  Section  1.5,  we  include  a  quan¬ 
titative  result  on  the  behavior  of  S-REAPER  for  finding  a  needle  in  a  haystack. 

Fact  1.2.  Assume  the  ambient  dimension  D  >  50,  and  fix  a  one- dimensional  subspace  L.  Suppose 
the  number  of  inliers  N{ n  >13  and  the  number  of  outliers  IVout  =  2D.  Draw  a  dataset  X  according 
to  the  Haystack  Model  on  page  6.  Then  11^  is  the  unique  solution  to  the  S-REAPER  problem  (1.6) 
at  least  99%  of  the  time. 

Fact  1.2  follows  from  Theorem  B.l.  We  highlight  this  ad  hoc  result  to  underscore  the  point  that 
it  takes  only  a  few  good  observations  for  (1.6)  to  identify  inliers  that  fall  on  a  line.  Indeed,  Fact  1.2 
is  valid  even  in  dimension  D  =  1037  with  Nout  =  2  x  10  . 


1.7.2.  Is  it  Hard  to  Find  a  Needle  in  a  Haystack?  Our  low-dimensional  intuition  suggests  that  it 
might  be  difficult  to  identify  the  subspace  in  the  Haystack  Model,  but  there  are  several  algorithms 
that  can  complete  the  task.  For  instance,  the  EM  algorithm  for  normal  mixtures  has  no  trouble 
with  this  example  [Recl2].  In  contrast,  for  natural  data,  we  have  found  that  the  S-REAPER  method 
often  outperforms  other  methods  by  a  reasonable  margin. 

Let  us  emphasize  that  we  do  not  intend  for  the  Haystack  Model  to  describe  natural  data.  We 
have  introduced  this  formulation  as  a  way  to  probe  the  limitations  of  the  reaper  and  S-REAPER 
problems  in  experiments.  The  Haystack  Model  also  gives  us  some  theoretical  insight  on  their  exact 
recovery  behavior.  We  believe  that  Theorem  1.1  is  compelling  in  part  because  the  optimization 
problems  are  derived  without  knowledge  of  the  model.  It  would  be  valuable  to  analyze  reaper 
and  S-REAPER  with  more  sophisticated  data  models,  but  we  leave  this  task  for  future  work. 
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1.8.  Summary  of  Contributions.  This  work  partakes  in  a  larger  research  vision:  Given  a  difficult 
nonconvex  optimization  problem,  it  is  often  more  effective  to  solve  a  convex  variant  than  to  accept 
a  local  minimizer  of  the  original  problem. 

We  believe  that  the  main  point  of  interest  in  this  paper  is  our  application  of  convex  optimization 
to  solve  a  problem  involving  subspaces.  There  are  two  key  observations  here.  We  parameterize 
subspaces  by  orthoprojectors,  and  then  we  replace  the  set  of  orthoprojectors  with  its  convex  hull 
{P  :  0  P  ^  I}.  Neither  of  these  ideas  is  new,  but  this  research  provides  clear  evidence  that  they 
can  be  used  to  address  important  geometric  questions  in  data  analysis. 

To  support  the  claim  that  our  approach  works,  we  develop  a  deterministic  analysis  that  describes 
when  reaper  and  S-REAPER  can  recover  a  subspace  exactly.  This  analysis  yields  summary  statistics 
that  predict  when  the  optimization  problems  succeed.  For  data  that  conform  to  the  Haystack 
Model  on  page  6,  we  can  evaluate  these  summary  statistics  to  obtain  the  exact  recovery  guarantees 
of  Theorem  1.1.  See  Section  3  for  more  information. 

In  Section  4,  we  present  an  efficient  iterative  algorithm  for  solving  the  semidefinite  program  (1.4). 
The  algorithm  typically  requires  a  modest  number  of  spectral  computations,  so  it  scales  to  relatively 
large  problem  instances. 

In  Section  5,  we  use  our  algorithm  to  perform  some  experiments  that  describe  the  behavior  of 
reaper  and  S-REAPER.  First,  we  study  synthetic  data  problems  to  understand  the  limitations  of 
S-REAPER;  these  experiments  indicate  that  Theorem  1.1  is  qualitatively  correct.  We  also  describe 
some  stylized  applications  involving  natural  data;  these  results  show  that  S-REAPER  dominates  other 
robust  PCA  techniques  in  several  situations.  Furthermore,  the  linear  models  we  obtain  appear  to 
generalize  better  than  the  robust  models  obtained  with  some  other  methods. 

We  describe  a  variety  of  future  research  directions  in  Section  6.  In  particular,  we  propose  some 
alternative  formulations  of  the  reaper  problem.  We  also  suggest  some  ways  to  extend  our  approach 
to  manifold  learning  and  hybrid  linear  modeling. 

The  technical  details  of  our  work  appear  in  the  appendices. 

2.  Previous  Work 

Robust  linear  modeling  has  been  an  active  subject  of  research  for  over  three  decades.  Although 
many  classical  approaches  have  strong  robustness  properties  in  theory ,  the  proposals  usually  involve 
intractable  computational  problems.  More  recently,  researchers  have  developed  several  techniques, 
based  on  convex  optimization,  that  are  computationally  efficient  and  admit  some  theoretical  guar¬ 
antees.  In  this  section,  we  summarize  classical  and  contemporary  work  on  robust  linear  modeling, 
with  a  focus  on  the  numerical  aspects.  We  recommend  the  books  [HR09,  MMY06,  RL87]  for  a 
comprehensive  discussion  of  robust  statistics. 

2.1.  Classical  Strategies  for  Robust  Linear  Modeling.  We  begin  with  an  overview  of  the 
major  techniques  that  have  been  proposed  in  the  statistics  literature.  The  theoretical  contributions 
in  this  area  focus  on  breakdown  points  and  influence  functions  of  estimators.  Researchers  tend  to 
devote  less  attention  to  the  computational  challenges  inherent  in  these  formulations. 

2.1.1.  Robust  Combination  of  Residuals.  Historically,  one  of  the  earliest  approaches  to  linear  re¬ 
gression  is  to  minimize  the  sum  of  (nonorthogonal)  residuals.  This  is  the  principle  of  least  absolute 
deviations  (LAD).  Early  proponents  of  this  idea  include  Galileo,  Boscovich,  Laplace,  and  Edge- 
worth.  See  [Har74a,  Har74b,  Dod87]  for  some  historical  discussion.  It  appears  that  orthogonal 
regression  with  LAD  was  first  considered  in  the  late  1980s  [OW85,  SW87,  Nyq88];  the  extension 
from  orthogonal  regression  to  PCA  seems  to  be  even  more  recent  [WatOl,  DZHZ06].  LAD  has  also 
been  considered  as  a  method  for  hybrid  linear  modeling  in  [ZSL09,  LZ11].  We  are  not  aware  of 
tractable  algorithm  for  these  formulations. 

There  are  many  other  robust  methods  for  combining  residuals  aside  from  LAD.  An  approach 
that  has  received  wide  attention  is  to  minimize  the  median  of  the  squared  residuals  [Rou84,  R.L87]. 
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Other  methods  appear  in  the  books  [HR09,  MMY06].  These  formulations  are  generally  not  com¬ 
putationally  tractable. 

2.1.2.  Robust  Estimation  of  Covariance  Matrix.  Another  standard  technique  for  robust  PCA  is  to 
form  a  robust  estimate  of  the  covariance  matrix  of  the  data  [Mar76,  HR09,  MMY06,  DGK81,  Dav87, 
XY95,  CHOO].  The  classical  robust  estimator  of  covariance  is  based  on  the  maximum  likelihood 
principle,  but  there  are  many  other  approaches,  including  S-estimators,  the  minimum  covariance 
determinant  (MCD),  the  minimum  volume  ellipsoid  (MVE),  and  the  Stahel-  Donoho  estimator. 
See  [MMY06,  Sec.  6]  for  a  review.  We  are  not  aware  of  any  scalable  algorithm  for  implementing 
these  methods  that  offers  a  guarantee  of  correctness. 

2.1.3.  Projection  Pursuit  PCA.  Projection  pursuit  (often  abbreviated  PP-PCA)  is  a  procedure 
that  constructs  principal  components  one  at  a  time  by  finding  a  direction  that  maximizes  a  ro¬ 
bust  measure  of  scale,  removing  the  component  of  the  data  in  this  direction,  and  repeating  the 
process.  The  initial  proposal  appears  in  [HR09,  1st  ed.],  and  it  has  been  explored  by  many  other 
authors  [LC85,  Amm93,  CFO07,  Kwa08,  WTH09].  We  are  aware  of  only  one  formulation  that 
provably  (approximately)  maximizes  the  robust  measure  of  scale  at  each  iteration  [MT11],  but 
there  are  no  overall  guarantees  for  this  algorithm. 

2.1.4.  Screening  for  Outliers.  Another  common  approach  is  to  remove  possible  outliers  and  then 
estimate  the  underlying  subspace  by  PCA  [CW82,  TB01,  TB03].  The  classical  methods  offer  very 
limited  guarantees.  There  are  some  recent  algorithms  that  are  provably  correct  [Bru09,  XCM10] 
under  model  assumptions. 

2.1.5.  RANSAC.  The  randomized  sample  consensus  (RANSAC)  method  is  a  randomized  iterative 
procedure  for  fitting  models  to  noisy  data  consisting  of  inkers  and  outliers  [FB81] .  Under  some 
assumptions,  RANSAC  will  eventually  identify  a  linear  model  for  the  inkers,  but  there  are  no 
guarantees  on  the  number  of  iterations  required. 

2.1.6.  Spherical  PCA.  A  useful  method  for  fitting  a  robust  linear  model  is  to  center  the  data 
robustly,  project  it  onto  a  sphere,  and  then  apply  standard  PCA.  This  approach  is  due  to  [LMS+99]. 
Maronna  et  al.  [MMY06]  recommend  it  as  a  preferred  method  for  robust  PCA.  The  technique  is 
computationally  practical,  but  it  has  limited  theoretical  guarantees. 

2.2.  Approaches  Based  on  Convex  Optimization.  Recently,  researchers  have  started  to  de¬ 
velop  effective  techniques  for  robust  linear  modeling  that  are  based  on  convex  optimization.  These 
formulations  invite  a  variety  of  tractable  algorithms,  and  they  have  theoretical  guarantees  under 
appropriate  model  assumptions. 

2.2.1.  Deconvolution  Methods.  One  class  of  techniques  for  robust  linear  modeling  is  based  on  split¬ 
ting  a  data  matrix  into  a  low-rank  model  plus  a  corruption.  The  first  approach  of  this  form  is 
due  to  Chandrasekaran  et  al.  [CSPW11].  Given  an  observed  matrix  X,  they  propose  to  solve  the 
semidefinite  problem 

minimize  ||P||Si  +  7 11^1^  subject  to  X  =  P  +  C.  (2-1) 

Minimizing  the  Schatten  1-norm  ||-||g  promotes  low  rank,  while  minimizing  the  vector  i\  norm 
promotes  sparsity.  The  regularization  parameter  7  negotiates  a  tradeoff  between  the  two  goals. 
Candes  et  al.  [CLMW11]  study  the  performance  of  (2.1)  for  robust  linear  modeling  in  the  setting 
where  individual  entries  of  the  matrix  X  are  subject  to  error. 

A  related  proposal  is  due  to  Xu  et  al.  [XCSIOa,  XCSIOb]  and  independently  to  McCoy  and 
Tropp  [MTll].  These  authors  recommend  solving  the  decomposition  problem 

minimize  ||P||S  +  7  ||C||^2  subject  to  X  =  P  +  C . 
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The  norm  ||-||*  2  returns  the  sum  of  Euclidean  norms  of  the  columns  of  its  argument.  This  formu¬ 
lation  is  appropriate  for  inlier-outlier  data  models,  where  entire  columns  of  the  data  matrix  may 
be  corrupted. 

Both  (2.1)  and  (2.2)  offer  some  theoretical  guarantees  under  appropriate  model  assumptions. 
The  most  common  algorithmic  framework  for  these  problems  depends  on  the  alternating  direction 
method  of  multipliers  (ADMM),  which  is  a  type  of  augmented  Lagrangian  scheme.  These  algorithms 
converge  slowly,  so  it  takes  excessive  computation  to  obtain  a  high-accuracy  solution. 

2.2.2.  Precedents  for  the  reaper  Problem.  The  reaper  problem  (1.4)  is  a  semidefinite  relaxation 
of  the  l\  orthogonal  distance  problem  (1.3).  Our  work  refines  a  weaker  relaxation  of  (1.3)  proposed 
by  Zhang  and  Lerman: 

minimize  ^  ||x  —  Px\\  subject  to  trP  =  l. 

Some  of  the  ideas  in  the  theoretical  analysis  of  reaper  are  drawn  from  [LZ10,  ZL11] .  The  IRLS 
algorithm  for  (1.4)  and  the  convergence  analysis  that  we  present  in  Section  4  are  also  based  on  the 
earlier  work. 

The  idea  of  relaxing  a  difficult  nonconvex  program  to  obtain  a  convex  problem  is  well  established 
in  the  literature  on  combinatorial  optimization.  Research  on  linear  programming  relaxations  is 
summarized  in  [Vaz03].  Some  significant  works  on  semidefinite  relaxation  include  [LS91,  GW95]. 

3.  Deterministic  Conditions  for  Exact  Recovery  of  a  Subspace 

In  this  section,  we  develop  a  deterministic  model  for  a  dataset  that  consists  of  inliers  located 
in  a  fixed  subspace  and  outliers  that  may  appear  anywhere  in  the  ambient  space.  We  introduce 
summary  statistics  that  allow  us  to  study  when  REAPER  or  S-REAPER  can  recover  exactly  the 
subspace  containing  the  inliers.  The  result  for  the  Haystack  Model,  Theorem  1.1,  follows  when  we 
estimate  the  values  for  these  summary  statistics. 

3.1.  A  Deterministic  Model  for  Studying  Exact  Recovery  of  a  Subspace.  We  begin 
with  a  deterministic  model  for  a  dataset  that  consists  of  low- dimensional  inliers  mixed  with  high¬ 
dimensional  outliers. 


Table  3.1.  The  In  &  Out  Model.  A  deterministic 
model  for  data  with  linear  structure  that  is  contami¬ 
nated  with  outliers. 


D 

Dimension  of  the  ambient  space 

L 

A  proper  d-dimensional  subspace  of  M.D 

Ain 

Number  of  inliers 

AOUt 

Number  of  outliers 

Ain 

Set  of  A^  inliers,  at  arbitrary  locations  in  L 

AOUt 

Set  of  Aout  outliers,  at  arbitrary  locations  in  R13  \  L 

X 

Set  Ain  U  Aout  containing  all  the  observations 

The  key  point  about  this  model  is  that  the  inliers  are  all  contained  within  the  subspace  L,  so  it 
is  reasonable  to  investigate  when  we  can  identify  L  exactly.  For  notational  convenience,  we  assume 
that  no  observation  is  repeated,  but  this  assumption  plays  no  role  in  our  analysis.  The  Haystack 
Model  on  page  6  almost  surely  produces  a  dataset  that  conforms  to  the  In  &  Out  Model. 
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3.2.  Summary  Parameters  for  the  In  &:  Out  Model.  In  general,  we  cannot  hope  to  identify 
the  subspace  L  that  appears  in  the  In  &  Out  Model  without  further  assumptions  on  the  data. 
This  section  describes  some  summary  statistics  that  allow  us  to  check  when  reaper  or  S-REAPER 
succeeds  in  computing  L.  These  quantities  play  a  role  analogous  to  the  coherence  statistic  in  the 
sparsity  literature  [DH01]. 

First,  let  us  discuss  what  property  of  the  inliers  might  allow  us  to  determine  the  subspace  L  that 
contains  them.  Roughly,  we  need  the  inlying  observations  to  permeate  L.  To  quantify  this  idea, 
we  define  the  permeance  statistic  (with  respect  to  L)  by  the  formula 

V{L)  :=  inf  l(w>  x)\ ■  (3-1) 

u&L  “ 

||ix||=l 

Similarly,  we  define  the  spherical  permeance  statistic  as 

V{L)  :=  inf  JZ  x)\  •  (3-2) 

ueL 
||lt||  =  l 


When  V(L)  =  0  (equivalently,  V(L)  =  0),  there  is  a  vector  in  the  subspace  L  that  is  orthogonal  to 
all  the  inliers.  In  this  case,  the  data  do  not  contain  enough  information  for  us  to  recover  L.  On  the 
other  hand,  when  one  of  the  permeance  statistics  is  large,  the  inliers  corroborate  each  direction  in 
the  subspace.  In  this  case,  we  expect  our  methods  for  linear  modeling  to  be  more  effective. 

To  ensure  recovery  of  L,  we  must  also  be  certain  that  the  outliers  do  not  exhibit  linear  structure. 
Otherwise,  there  is  no  way  to  decide  whether  a  subspace  captures  linear  structure  from  the  inliers 
or  the  outliers.  Indeed,  in  the  extreme  case  where  the  outliers  are  arranged  on  a  line,  any  sensible 
estimation  procedure  would  have  to  select  a  subspace  that  contains  this  line. 

Let  us  introduce  a  quantity  that  measures  the  linear  dependency  of  the  outliers  within  a  given 
subspace.  The  linear  structure  statistic  is  defined  for  each  subspace  M  C  MD  by  the  formula 

S(M)2  :=  sup  Y,  l<u»  *>|2  =  l|nMXout||2,.  (3.3) 

U£M  p  y 
11-1*11=1  *fc'tout 


where  Wout  isaDx  7Vout  matrix  whose  columns  are  the  outliers,  arranged  in  fixed  order.  Similarly, 
we  can  define  the  spherical  linear  structure  statistic 


<S(M)2  := 


sup 

ueM 

11*11=1 


E 

XG^out 


n  Mx 


=  WUmX, 


out  ||  > 


(3.4) 


where  IIj\fWout  isaDx  -/Vout  matrix  whose  columns  are  the  vectors  II  mx  f°r  each  x  £  Xout-  The 
linear  structure  statistics  tend  to  be  large  when  the  outliers  are  nearly  collinear  (after  projection 
onto  M),  while  the  statistics  tend  to  be  small  when  the  outliers  are  close  to  orthogonal. 


3.3.  Conditions  for  Exact  Recovery  of  a  Subspace.  We  are  now  prepared  to  state  conditions 
when  either  reaper  or  s-reaper  recovers  the  subspace  L  containing  the  inliers  exactly. 

Theorem  3.1.  Let  L  be  a  d-dimensional  subspace  o/MD,  and  assume  that  X  is  a  dataset  that 
follows  the  In  &  Out  Model  on  page  10. 

(i)  Suppose  that 

V{L)  >  ^2d-S{LL)  -S(Md).  (3.5) 

Then  II l  is  the  unique  minimizer  of  the  reaper  problem  (1.4). 

(ii)  Suppose  that 

V{L)  >  y/Td-S^)  -S(Rd). 

Then  11^  is  the  unique  minimizer  of  the  S-reaper  problem  (1.6). 


(3.6) 
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The  permeance  statistics  V  andV  are  defined  in  (3.1)  and  (3.2),  while  the  linear  structure  statistics 
S  and  S  are  defined  in  (3.3)  and  (3.4). 

Theorem  3.1  shows  that  the  reaper  (resp.  S-REAPER)  problem  can  find  the  subspace  L  con¬ 
taining  the  inliers  whenever  the  (spherical)  permeance  statistic  is  sufficiently  large  as  compared 
with  the  (spherical)  linear  structure  statistic.  Be  aware  that  the  inequalities  (3.5)  and  (3.6)  are 
not  comparable;  neither  one  implies  the  other.  The  main  difference  is  due  to  the  fact  that  the 
condition  (3.6)  for  S-REAPER  is  invariant  to  the  scalings  of  the  data  points. 

The  proof  of  Theorem  3.1  appears  in  Appendix  A.l.  The  main  idea  in  the  argument  is  to  verify 
that,  under  the  appropriate  sufficient  condition,  the  objective  function  of  reaper  or  S-REAPER 
increases  if  we  perturb  the  decision  variable  P  away  from  11^. 


3.4.  Exact  Recovery  for  the  Haystack  Model.  The  conditions  in  Theorem  3.1  hold  in  highly 
nontrivial — even  surprising — regimes.  In  this  section,  we  make  some  heuristic  calculations  to  indi¬ 
cate  how  the  spherical  summary  statistics  behave  when  X  is  drawn  from  the  Haystack  Model  on 
page  6.  We  prove  rigorous  results,  including  Theorem  1.1,  in  Appendix  B. 

First,  let  us  consider  the  spherical  permeance  statistic  V(L).  Fix  a  unit  vector  u  E  L,  and 
suppose  that  x  is  a  random  unit  vector  in  L.  The  random  variable  (u.  x)  is  approximately  normal 
with  mean  zero  and  variance  d-1,  so  we  have  E  |(w,  x)\  ~  ^2/ (ltd).  It  follows  that 


'p(l)  ~  X  x)\ 

a?GAin 


[2  N-m 
Vvr'  y/jf 


We  have  omitted  the  infimum  over  u  from  (3.1)  because  it  does  not  make  a  substantial  difference. 
See  Lemma  B.3  for  a  rigorous  computation. 

Next,  we  examine  the  spherical  linear  structure  statistic  S(M)  =  1 1 n^../ Xout|| .  This  quantity  is 
just  the  norm  of  a  matrix  with  -/Vout  columns  that  are  i.i.d.  uniform  on  the  unit  sphere  in  M.  For 
many  purposes,  this  type  of  matrix  behaves  like  a  dirn(M)  x  IVout  Gaussian  matrix  whose  entries 
have  variance  dim(M)-1.  Standard  bounds  for  the  norm  of  a  Gaussian  matrix  [DS01,  Thm.  2.13] 
give  the  estimate 


S(M) 


N, 


out 


dim  (M) 


+  1. 


Therefore,  we  can  estimate  the  product 


S(LX)  -S(MD) 


A 


out 


D 


+  1 


A 


out 


D-d 


+  1 


< 


N, 


out 


D-d 


+  1 


See  Lemma  B.5  for  more  careful  calculations. 

To  conclude,  when  the  data  X  is  drawn  from  the  Haystack  Model,  the  sufficient  condition  (3.6) 
for  S-REAPER  to  succeed  becomes 


Nin 

d 


> 


IT 


Aout 

D-d 


2 


+  1 


(3.7) 


Assume  that  D  —  d  is  comparable  with  D.  Then  the  heuristic  (3.7)  indicates  that  the  inlier  sampling 
ratio  pin  =  N[n/d  should  increase  linearly  with  the  the  outlier  sampling  ratio  pout  =  Nout/D.  This 
scaling  matches  the  rigorous  bound  in  Theorem  1.1. 


4.  An  Iterative  Reweighted  Least-Squares  Algorithm  for  reaper 

In  this  section,  we  present  a  numerical  algorithm  for  solving  the  reaper  problem  (1.4).  Of 
course,  we  can  also  solve  S-REAPER  by  spherizing  the  data  before  applying  this  algorithm.  Our 
approach  is  based  on  the  iterative  reweighted  least  squares  (IRLS)  framework  [Bjo96,  Sec.  4.5.2], 
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At  each  step,  we  solve  a  least-squares  problem  with  an  adapted  set  of  scaling  parameters.  The 
subproblem  has  a  closed-form  solution  that  we  can  obtain  by  computing  an  eigenvalue  (or  singular 
value)  decomposition.  Empirically,  the  IRLS  approach  exhibits  linear  convergence,  so  it  can  achieve 
high  accuracy  without  an  exorbitant  number  of  iterations.  It  is  possible  to  apply  other  types  of 
optimization  algorithms,  such  as  interior-point  methods,  to  solve  reaper,  but  we  are  not  aware  of 
an  alternative  approach  that  scales  to  large  datasets. 

4.1.  A  Weighted  Least-Squares  Problem.  The  reaper  problem  (1.4)  cannot  be  solved  in 
closed  form,  but  there  is  a  closely  related  problem  whose  minimizer  can  be  computed  efficiently. 
We  use  this  mathematical  program  and  its  solution  to  build  an  algorithm  for  solving  reaper. 

For  each  x  6  A,  let  px  be  a  nonnegative  weight  parameter.  Consider  the  weighted  least-squares 
problem  with  the  same  constraints  as  (1.4): 

minimize  Px  \\x  —  Px\\2  subject  to  0  P  =4  I  and  tr  P  =  d.  (4-1) 


Algorithm  4.1  Solving  the  weighted  least-squares  problem  (4.1) 

Input: 

•  A  set  X  of  observations  in  WD 

•  A  nonnegative  weight  Px  for  each  x  G  X 

•  The  parameter  d  in  (4.1),  where  0  <  d  <  D 
Output: 

•  A  D  x  D  matrix  P*  that  solves  (4.1) 

Procedure: 

1  Form  the  D  x  D  weighted  covariance  matrix 

C  ^  Px  xx 1 

x&X 

2  Compute  an  eigenvalue  decomposition  C  =  U  ■  rliag(Ai, . . . ,  Ad)  •  Ul  with  eigenvalues  in 
nonincreasing  order:  Ai>--->Ad>0 

3  if  A^j+1  =  0  then 

a  for  i  =  1 , . . . ,  D  do 

f  1,  i  <  L d\ 

Vi  < —  <  d  —  |_dj ,  i  =  \d\  +  1 
I  0,  otherwise 


b  for  i  =  1 , . . . ,  D  do 


Vi  <— 


A  i>0 
A  i<6 


5  return  P*  :=  U  ■  diag(iq, . . . ,  pd)  •  Ul 
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Algorithm  4.1  describes  a  computational  procedure  for  solving  (4.1).  The  basic  idea  is  to  compute 
a  weighted  covariance  matrix  C  from  the  data.  We  use  spectral  shrinkage  to  scale  the  eigenvalues 
of  C  to  the  range  [0, 1].  The  amount  of  shrinkage  is  determined  by  a  water-filling  calculation,  which 
ensures  that  the  output  matrix  the  correct  trace.  In  Appendix  C.l,  we  present  a  more  mathematical 
formulation  of  Algorithm  4.1,  and  we  use  this  statement  to  establish  correctness. 

4.1.1.  Discussion  and  Computational  Costs.  The  bulk  of  the  computation  in  Algorithm  4.1  takes 
place  during  the  spectral  calculation  in  Steps  1  and  2.  We  need  0(ND 2)  arithmetic  operations  to 
form  the  weighted  covariance  matrix,  and  the  spectral  calculation  requires  0{Di).  The  remaining 
steps  of  the  algorithm  have  order  O(D). 

We  have  presented  Algorithm  4.1  in  the  most  direct  way  possible.  In  practice,  it  is  usually 
more  efficient  to  form  a  D  x  N  matrix  W  with  columns  y//3^a:  for  x  £  A,  to  compute  a  thin  SVD 
W  =  UYiV1,  and  to  set  A  =  X2.  This  approach  is  also  more  stable.  In  some  situations,  it  i  possible 
to  accelerate  the  spectral  calculations  using  randomized  dimension  reduction,  as  in  [HMT11], 

4.2.  Iterative  Reweighted  Least-Squares  Algorithm.  We  are  prepared  to  develop  an  algo¬ 
rithm  for  solving  the  reaper  problem  (1.4).  Let  us  begin  with  the  intuition.  Suppose  that  P*  solves 
the  weighted  least-squares  problem  (4.1)  where  (3X  ~  \\x  —  P*x\\~L  for  each  x  6  X .  Examining  the 
objective  function  of  (4.1),  we  see  that 

^2  fix  II*  -  P**||2  «  llx  -  ^**11  • 

Therefore,  it  seems  plausible  that  P*  is  also  close  to  the  minimizer  of  the  reaper  problem  (1.4). 
This  heuristic  is  classical,  and  it  remains  valid  in  our  setting. 


Algorithm  4.2  IRLS  algorithm  for  solving  the  reaper  problem  (1.4) 
Input: 

•  A  set  X  of  observations  in  WD 

•  The  dimension  parameter  d  in  (1.4),  where  0  <  d  <  D 

•  A  regularization  parameter  6  >  0 

•  A  stopping  tolerance  e  >  0 
Output: 


•  AD  x  D  matrix  P*  that  satisfies  0  =<;  P*  ^  I  and  tr  P*  =  d 
Procedure: 

1  Initialize  the  variables: 

a  Set  the  iteration  counter  k  <—  0 
b  Set  the  initial  error  cA0)  < — boo 
c  Set  the  weight  j3x  <—  1  for  each  x  €  X 


2 


3 


do 

a 

b 

c 

d 


Increment  k  4—  k  +  1 

Use  Algorithm  4.1  to  compute  an  optimal  point  plO  Qf  (4.1)  with  weights  /3X 
Let  be  the  optimal  value  of  (4.1)  at  P ^ 

Update  the  weights: 


fix 


1 

max  {<5,  II*  —  P(fc)*||  } 


for  each  *  6  X 


until  the  objective  fails  to  decrease:  cAfc)  >  a(fc  P  —  e 


Return  P*  =  P ^ 
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This  observation  motivates  an  iterative  procedure  for  solving  (1.4).  Let  5  be  a  positive  regular¬ 
ization  parameter.  Set  the  iteration  counter  k  0,  and  set  the  weight  (3X  1  for  each  x  £  X .  We 

solve  (4.1)  with  the  weights  f3x  to  obtain  a  matrix  P^k\  and  then  we  update  the  weights  according 
to  the  formula 


Px 


1 


for  each  x  £  X . 


max  {5,  ||®  —  P(k)x ||  ) 

In  other  words,  we  emphasize  the  observations  that  are  explained  well  by  the  current  model.  The 
presence  of  the  regularization  5  ensures  that  no  single  point  can  gain  undue  influence.  We  increment 
k,  and  we  repeat  the  process  until  it  has  converged. 

We  summarize  the  computational  procedure  in  Algorithm  4.2.  The  following  result  shows  that 
Algorithm  4.2  is  guaranteed  to  converge  to  a  point  whose  value  is  close  to  the  optimal  value  of  the 
REAPER  problem  (1.4). 


Theorem  4.1  (Convergence  of  IRLS).  Assume  that  the  set  X  of  observations  does  not  lie  in  the 
union  of  two  strict  subspaces  ofWD.  Then  the  iterates  of  Algorithm  f.2  with  £  =  0  converge  to  a 
point  Ps  that  satisfies  the  constraints  of  the  reaper  problem  (1.4).  Moreover,  the  objective  value 
at  Ps  satisfies  the  bound 

Y  II*  -  ^5*11  -  Y  Wx  ~  P*XW  -  \X\  ’ 

x£X  x&X  1 

where  P*  is  an  optimal  point  of  reaper. 


The  proof  of  Theorem  4.1  is  similar  to  established  convergence  arguments  [ZL11,  Thms.  6  and  8], 
which  follow  the  schema  in  [CM99,  VE80].  See  Appendix  C  for  a  summary  of  the  proof.  It  can 
also  be  shown  that  the  point  Pg  is  close  to  the  optimal  point  P*  in  Frobenius  norm. 


4.2.1.  Discussion  and  Computational  Costs.  The  bulk  of  the  computational  cost  in  Algorithm  4.2 
is  due  the  solution  of  the  subproblem  in  Step  2b.  Therefore,  each  iteration  costs  0(ND2).  The 
algorithm  converges  linearly  in  practice,  so  we  need  0(\og(l/rj))  iterations  to  achieve  an  error  of  r/. 

Figure  4.1  shows  that,  empirically,  Algorithm  4.2  exhibits  linear  convergence.  In  this  experiment, 
we  have  drawn  the  data  from  the  Haystack  Model  on  page  6  with  ambient  dimension  D  =  100  and 
Aout  =  200  outliers.  The  curves  mark  the  performance  for  several  choices  of  the  model  dimension  d 
and  the  inlier  sampling  ratio  pin  =  N-m/d.  For  this  plot,  we  run  Algorithm  4.2  with  the  regularization 
parameter  5  =  10~10  and  the  error  tolerance  e  =  10-15  to  obtain  a  sequence  {p(fc)}  of  iterates.  We 
compare  the  computed  iterates  with  an  optimal  point  P*  of  the  reaper  problem  (1.4)  obtained  by 
solving  reaper  with  the  Matlab  package  CVX  [GB08,  GB10]  at  the  highest-precision  setting.  The 
error  is  measured  in  Frobenius  norm. 

For  synthetic  data,  the  number  of  iterations  required  for  Algorithm  4.2  seems  to  depend  on 
the  difficulty  of  the  problem  instance.  Indeed,  it  may  take  as  many  as  200  iterations  for  the 
method  to  converge  on  challenging  examples.  In  experiments  with  natural  data,  we  usually  obtain 
good  performance  after  20  iterations  or  so.  In  practice,  Algorithm  4.2  is  also  much  faster  than 
algorithms  [XCSIOb,  MT11]  for  solving  the  low-leverage  decomposition  problem  (2.2). 

The  stopping  criterion  in  Algorithm  4.2  is  motivated  by  the  fact  that  the  objective  value  must 
decrease  in  each  iteration.  This  result  is  a  consequence  of  the  proof  of  Theorem  4.1;  see  (C.7). 
Taking  e  on  the  order  of  machine  precision  ensures  that  the  algorithm  terminates  when  the  iterates 
are  dominated  by  numerical  errors.  In  practice,  we  achieve  very  precise  results  when  e  =  10“ 15  or 
even  e  =  0.  In  many  applications,  this  degree  of  care  is  excessive,  and  we  can  obtain  a  reasonable 
solution  for  much  larger  values  of  e. 

Theorem  4.1  requires  a  technical  condition,  namely  that  the  observations  do  not  lie  in  the  union 
of  two  strict  subspaces.  This  condition  cannot  hold  unless  we  have  at  least  2D  —  1  observations. 
In  practice,  we  find  that  this  restriction  is  unnecessary  for  the  algorithm  to  converge.  It  seems  to 
be  an  artifact  of  the  proof  technique. 
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Figure  4.1.  Convergence  of  IRLS  to  an  optimal  point.  The  data  are  drawn  from 
the  Haystack  Model  on  page  6  with  ambient  dimension  D  =  100  and  IVout  =  200 
outliers.  Each  curve  is  associated  with  a  particular  choice  of  model  dimension  d 
and  inlier  sampling  ratio  pin  =  N-in/d.  We  use  Algorithm  4.2  to  compute  a  se¬ 
quence  {plfc)}  of  iterates,  which  we  compare  to  an  optimal  point  P*  of  the  REAPER 
problem  (1.4).  See  the  text  in  Section  4.2  for  more  details  of  the  experiment. 

5.  Experiments 

In  this  section,  we  present  some  numerical  experiments  involving  reaper  and  s-reaper.  To 
probe  the  limits  of  these  optimization  problems  for  fitting  a  linear  model  in  the  presence  of  outliers, 
we  study  how  they  behave  on  some  simple  random  data  models.  To  gauge  their  performance  in 
more  realistic  settings,  we  consider  some  stylized  problems  involving  natural  data. 

5.1.  Experimental  Setup.  To  solve  the  reaper  problem  (1.4)  and  the  S-reaper  problem  (1.6), 
we  use  the  IRLS  method,  Algorithm  4.2.  We  set  the  regularization  parameter  S  =  10-10  and  the 
stopping  tolerance  e  =  10-15. 

In  all  the  cases  considered  here,  the  dimension  parameter  d  is  a  positive  integer.  Unless  we  state 
otherwise,  we  postprocess  the  computed  optimal  point  P*  of  reaper  or  S-reaper  to  obtain  a 
d-dimensional  linear  model.  When  P*  has  rank  greater  than  d,  we  construct  a  d-dimensional  linear 
model  by  using  the  span  of  the  d  dominant  eigenvectors  of  P+.  Our  goal  is  to  use  a  consistent 
methodology  that  involves  no  parameter  tweaking.  Other  truncation  strategies  are  also  possible, 
as  we  discuss  in  Section  6.1.6. 

5.2.  Comparisons.  By  now,  there  are  a  huge  number  of  proposals  for  robust  linear  modeling,  so 
we  have  limited  our  attention  to  methods  that  are  computationally  tractable.  That  is,  we  consider 
only  formulations  that  have  a  polynomial-time  algorithm  for  constructing  a  global  solution  (up 
to  some  tolerance).  We  do  not  discuss  techniques  that  involve  Monte  Carlo  simulation,  nonlinear 
programming,  etc.  because  the  success  of  these  approaches  depends  largely  on  parameter  settings 
and  providence.  As  a  consequence,  it  is  hard  to  evaluate  their  behavior  in  a  consistent  way. 
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We  consider  two  standard  approaches,  PCA  [Jol02]  and  spherical  PCA  [LMS+99].  Spherical  PCA 
rescales  each  observation  so  it  lies  on  the  Euclidean  sphere,  and  then  it  applies  standard  PCA.  In 
a  recent  book,  Maronna  et  al.  [MMY06]  recommend  spherical  PCA  as  the  most  reliable  classical 
algorithm,  which  provides  a  second  justification  for  omitting  many  other  established  methods. 

We  also  consider  a  more  recent  proposal  [XCSIOa,  XCSIOb,  MT11],  which  is  called  low-leverage 
decomposition  (LLD)  or  outlier  pursuit.  This  method  decomposes  the  D  x  N  matrix  X  of  obser¬ 
vations  by  solving  the  optimization  problem 

minimize  ||P||S  +  7  ||C||*^2  subject  to  X  =  P  +  C  (5.1) 

where  ||j|s  is  the  Schatten  1-norm  and  ||  •  II  1—^2  returns  the  sum  of  Euclidean  norms  of  the  column. 
The  idea  is  that  the  optimizer  (P+,C+)  will  consist  of  a  low-rank  model  -P*  for  the  data  along 
with  a  column-sparse  matrix  (7*  that  identifies  the  outliers.  We  always  use  the  parameter  choice 
7  =  0.8 ^Jd/N,  which  seems  to  be  effective  in  practice. 

We  do  not  make  comparisons  with  the  rank-sparsity  decomposition  [CSPW11],  which  has  also 
been  considered  for  robust  linear  modeling  in  [CLMW11].  It  is  not  effective  for  the  problems  that 
we  consider  here. 

5.3.  Phase  Transitions  for  Exact  Recovery  of  Subspaces.  This  section  studies  the  limits  of 
the  S-REAPER  problem  by  applying  it  to  data  drawn  from  the  Haystack  Model  on  page  6.  The  goal 
is  to  understand  when  S-REAPER  can  recover  a  low-dimensional  subspace  exactly,  even  when  the 
dataset  contains  many  unstructured  outliers.  Theorem  1.1  tells  us  that  the  inlier  sampling  ratio 
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Figure  5.1.  Exact  recovery  of  a  needle  in  a  haystack  via  s-reaper.  The  data  is 
drawn  from  the  Haystack  Model  on  page  6  with  subspace  dimension  d  =  1.  The 
ambient  dimension  D  =  3  [left]  and  D  =  100  [right].  The  blue  curves  show  the 
theoretical  50%  success  threshold  given  by  (B.lc).  The  red  circles  mark  the  empirical 
50%  success  threshold  for  each  value  of  pm,  and  the  yellow  lines  indicate  the  least- 
squares  fit.  The  trends  are  pin  =  1.41/?out  +  0.59  [left]  and  pm  =  1.53pout  +  2.68 
[right].  See  Section  5.3  for  details. 
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Pin  =  Nin/d  and  the  outlier  sampling  ratio  pou t  =  Nout/D  are  relevant  to  exact  recovery.  Indeed, 
we  expect  to  see  a  linear  relationship  between  these  two  parameters. 

In  these  experiments,  we  fix  the  ambient  dimension  D  and  the  dimension  d  of  the  subspace  L. 
The  number  N-m  of  inliers  and  the  number  Nout  of  outliers  vary.  Note  that  the  specific  choice  of  the 
subspace  L  is  immaterial  because  the  model  is  rotationally  invariant.  The  variance  parameters  can 
be  equated  (ofn  =  =  1)  because  they  do  not  play  any  role  in  the  performance  of  S-REAPER. 

We  want  to  determine  when  the  solution  P*  to  the  s-reaper  problem  equals  the  projector  IIx, 
onto  the  model  subspace,  so  we  declare  the  experiment  a  success  when  the  spectral-norm  error 
||p*  -  nL||  10  For  each  pair  (pm? Pout)?  we  repeat  the  experiment  25  times  and  calculate  an 
empirical  success  probability. 

Figure  5.1  charts  the  performance  of  S-reaper  for  recovering  a  one-dimensional  subspace  in 
ambient  dimension  D  =  3  and  D  =  100.  For  each  value  of  pin,  we  identify  the  minimum  empirical 
value  of  /9out  where  the  success  probability  is  50%,  and  we  use  least-squares  to  fit  a  linear  trend 
(with  pin  the  independent  variable).  This  experiment  shows  that  the  linear  relationship  between 
Pin  and  pout  is  valid  for  the  needle  problem.  The  theoretical  bound  in  Theorem  B.l  overestimates 
the  empirical  trend  by  approximately  a  factor  of  two. 

We  repeat  the  same  experiment  for  larger  values  of  the  subspace  dimension  d.  Figure  5.2  shows 
the  results  when  ( d,  D )  equals  (2,3)  or  (25, 100).  Once  again,  we  see  a  linear  relationship  between 
Pin  and  Pout-  Our  theoretical  bound  in  these  cases  is  less  satisfactory;  Theorem  B.l  overestimates 
the  trend  by  approximately  a  factor  of  eight. 


Sheet  in  a  haystack:  d=2,  D=3 
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Subspace  in  a  haystack:  d=25,  D=100 


Figure  5.2.  Exact  recovery  of  a  subspace  in  a  haystack  via  S-reaper.  The  data 
is  drawn  from  the  Haystack  Model  on  page  6  with  (d,  D )  =  (2,  3)  [left]  and  ( d ,  D)  = 
(25, 100)  [right].  The  red  circles  mark  the  empirical  50%  success  threshold  for  each 
value  of  pin,  and  the  yellow  lines  indicate  the  least-squares  fit.  The  trends  are 
Pin  =  1.52pout  +  0.70  [left]  and  p;n  =  1.66pout  +  3.14  [right].  See  Section  5.3  for 
details. 
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Figure  5.3.  Mean  angle  between  the  needle  and  the  computed  model.  The  data  is 
drawn  from  the  Haystack  Model  on  page  6  with  ( d,D )  =  (1, 100).  The  image  shows 
the  empirical  mean,  over  25  trials,  of  the  angle  between  the  true  subspace  and  a 
dominant  eigenvector  of  a  solution  to  the  S-Reaper  problem  (1.6).  See  Section  5.3.1 
for  details. 


5.3.1.  Graceful  Degradation  after  Recovery  Threshold.  The  phase  transition  plots  in  Figure  5.1 
and  5.2  are  misleading  because  S-REAPER  continues  to  work  well,  even  when  it  does  not  recover 
the  true  subspace  exactly.  A  different  view  of  the  data  from  the  phase  transition  experiments  in 
Section  5.3  captures  how  our  procedure  breaks  down.  Figure  5.3  shows  the  empirical  mean  angle 
between  the  true  subspace  L  and  the  dominant  eigenvector  of  the  solution  to  the  S-REAPER 
problem.  The  angle  increases  gradually  as  the  outlier  sampling  ratio  increases.  We  have  also  found 
that  the  solution  of  S-REAPER  degrades  gracefully  when  the  dimension  d  of  the  subspace  is  larger. 

5.4.  Faces  in  a  Crowd.  The  next  experiment  is  designed  to  test  how  well  several  robust  methods 
are  able  to  fit  a  linear  model  to  face  images  that  are  dispersed  in  a  collection  of  random  images. 
This  setup  allows  us  to  study  how  well  the  robust  model  generalizes  to  faces  we  have  not  seen. 

We  draw  64  images  of  a  single  face  under  different  illuminations  from  the  Extended  Yale  Face 
Database  [LHK05].  We  use  the  first  32  faces  for  the  sample,  and  we  reserve  the  other  32  for  the  out- 
of-sample  test.  Next,  we  add  400  additional  random  images  from  the  BACKGROUND/ Google  folder  of 
the  CaltechlOl  database  [Cal06,  FFFP04].  Each  image  is  converted  to  grayscale  and  downsampled 
to  20  x  20  pixels.  We  center  the  images  by  subtracting  the  Euclidean  median  (1.5).  Then  we 
apply  PCA,  spherical  PCA,  LLD,  reaper,  and  S-REAPER  to  fit  a  nine-dimensional  subspace  to  the 
data.  See  [BJ03]  for  justification  of  the  choice  d  =  9.  This  experiment  is  similar  to  work  reported 
in  [LLY+10,  Sec.  VI]. 

Figure  5.4  displays  several  images  from  the  sample  projected  onto  the  computed  nine-dimensional 
subspace  (with  the  centering  added  back  after  projection).  For  every  method,  the  projection  of 
an  in-sample  face  image  onto  the  subspace  is  recognizable  as  a  face.  Meanwhile,  the  out-of-sample 
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Original 
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S-REAP 


Figure  5.4.  Face  images  projected  onto  nine-dimensional  linear  model.  The 
dataset  consists  of  32  images  of  a  single  face  under  different  illuminations  and  400 
random  images  from  the  CaltechlOl  database.  The  original  images  [left  column] 
are  projected  onto  the  nine-dimensional  subspaces  computed  using  five  different 
modeling  techniques.  The  first  two  rows  indicate  how  well  the  models  explain  the 
in-sample  faces  versus  the  random  images.  The  last  two  rows  show  projections  of 
two  out-of-sample  faces,  which  were  not  used  to  compute  the  linear  models.  See 
Section  5.4  for  details. 

faces  are  described  poorly  by  the  PCA  subspace.  All  of  the  robust  subspaces  capture  the  facial 
features  better,  with  S-REAPER  producing  the  clearest  images. 

Figure  5.5  shows  the  ordered  distances  of  the  32  out-of-sample  faces  to  the  robust  linear  model 
as  a  function  of  the  ordered  distances  to  the  model  computed  with  PCA.  A  point  below  the  1:1  line 
means  that  the  zth  closest  point  is  closer  to  the  robust  model  than  the  itli  closest  point  is  to  the 
PCA  model.  Under  this  metric,  s-reaper  is  the  dominant  method,  which  explains  the  qualitative 
behavior  seen  in  Figure  5.4.  This  plot  clearly  demonstrates  that  S-reaper  computes  a  subspace 
that  generalizes  better  than  the  subspaces  obtained  with  the  other  robust  methods.  Indeed,  LLD 
performs  only  slightly  better  than  PCA. 
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Comparative  distance  of  out-of-sample  faces 
to  9-dimensional  subspace 


Distance  to  PCA  subspace 


Figure  5.5.  Approximation  of  out-of-sample  face  images  by  several  linear  models. 

The  ordered  distances  of  the  out-of-sample  face  images  to  each  robust  model  as  a 
function  of  the  ordered  distances  to  the  PCA  model.  The  model  was  computed 
from  32  in-sample  images;  this  graph  shows  how  the  model  generalizes  to  the  32 
out-of-sample  images.  Lower  is  better.  See  Section  5.4  for  details. 

5.5.  Low-Dimensional  Modeling  of  Textures.  Our  next  experiment  involves  an  image  of  a 
structured  texture  (a  pegboard  [SIP12,  File  1.5.02])  that  has  been  occluded  with  an  unstructured 
texture  (sand  [SIP12,  File  1.5.04]).  Shifts  of  a  structured  texture  can  be  explained  well  with  a  low¬ 
dimensional  model,  while  unstructured  textures  look  like  (anisotropic)  noise.  This  type  of  model 
might  be  interesting  for  inpainting  applications. 

See  Figure  5.6  for  the  images  we  use.  The  occlusion  covers  over  80%  of  the  original  image,  and  it 
is  scaled  so  that  the  original  image  is  faint  in  comparison  with  the  occlusion.  We  block  the  occluded 
image  into  24  x  24  patches,  and  we  vectorize  the  resulting  441  patches  to  obtain  observations  in 
a  576-dinrensional  space.  We  center  the  data  about  the  Euclidean  median  (1.5)  of  the  patches. 
We  compute  a  three-dimensional  linear  model  for  the  patches  in  the  occluded  image  using  PCA, 
spherical  PCA,  LLD,  reaper,  and  S-REAPER. 

Figure  5.7  shows  the  reconstruction  of  some  patches  from  the  original,  unoccluded  image.  It 
appears  that  S-REAPER  and  reaper  reproduce  the  structured  texture  better  than  PCA  or  spherical 
PCA.  Figure  5.8  shows  a  plot  of  the  ordered  distances  of  patches  in  the  original  image  to  the  robust 
linear  model  versus  the  ordered  distances  to  the  model  computed  by  PCA.  Note  that  the  models 
are  all  computed  from  the  occluded  image. 

The  plot  confirms  the  qualitative  behavior  seen  in  Figure  5.7:  the  regression  planes  determined 
by  reaper  and  s-reaper  capture  the  energy  of  the  patches  in  the  original  image  better  than 
PCA  and  spherical  PCA.  Here,  the  advantage  of  spherization  coupled  with  reaper  is  clear:  the 
performance  of  S-REAPER  is  significantly  better  than  either  spherical  PCA  or  REAPER. 
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Original  structured  image 
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Occluded  Image 


Figure  5.6.  Original  image  and  occluded  image.  The  original  [left]  and  occluded 
[right]  images  used  in  this  experiment.  The  yellow  gridlines  demarcate  the  bound¬ 
aries  of  the  24  x  24  patches. 


Figure  5.7.  Reconstruction  of  some  patches  from  the  original  texture.  The  left 
column  shows  three  patches  from  the  original  texture  that  are  completely  hidden  in 
the  occluded  image.  The  next  four  columns  show  the  projections  of  these  patches 
onto  the  three-dimensional  subspace  computed  from  the  occluded  image.  We  omit 
the  reconstruction  obtained  with  LLD  because  it  is  substantially  the  same  as  the 
result  with  PCA.  See  Section  5.5  for  details. 
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Distance  of  patches  in  original  image 
to  the  3-dimensional  linear  model 


Distance  to  PCA  subspace 


Figure  5.8.  Approximation  of  original  image  patches  by  several  linear  models. 

We  use  four  robust  methods  to  determine  a  three-dimensional  linear  model  for  the 
patches  in  the  occluded  image,  Figure  5.6.  The  plot  shows  the  ordered  distances  of 
the  patches  in  the  original  image  to  each  robust  models  as  a  function  of  the  ordered 
distances  to  the  PCA  model.  Lower  is  better.  See  Section  5.5  for  details. 

6.  Extensions  and  Modifications 

There  are  many  opportunities  for  research  on  robust  linear  models.  For  example,  it  would  also 
be  valuable  to  extend  the  analysis  in  this  work  to  more  sophisticated  data  models  and  to  consider 
more  realistic  experimental  setups.  There  are  also  many  variants  of  the  reaper  problem  that  may 
be  effective,  and  we  believe  that  our  formulations  could  be  extended  to  more  challenging  problems. 
In  this  section,  we  survey  some  of  the  possibilities  for  future  work. 

6.1.  Variations  on  the  reaper  Problem.  The  reaper  formulation  (1.4)  is  based  on  a  relaxation 
of  the  l\  orthogonal  distance  problem  (1.3).  First,  we  examine  what  happens  if  we  try  to  strengthen 
or  weaken  the  constraints  in  reaper.  Next,  we  discuss  some  alternatives  for  the  i\  objective 
function  that  we  have  applied  to  the  residuals  in  both  (1.3)  and  (1.4).  Third,  we  show  that  it  is 
possible  to  incorporate  a  centering  step  directly  into  the  optimization.  Last,  we  describe  several 
methods  for  constructing  a  linear  model  from  the  solution  to  the  REAPER  problem. 

6.1.1.  A  Stronger  Relaxation?  The  constraint  set  in  (1.3)  consists  of  all  trace-d  orthoprojectors, 
where  d  is  a  positive  integer.  In  the  reaper  problem,  we  relax  this  constraint  to  include  all  trace-d 
matrices  whose  eigenvalues  fall  in  [0, 1].  One  may  wonder  whether  a  tighter  relaxation  is  possible. 
If  we  restrict  our  attention  to  convex  sets,  the  answer  is  negative. 
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Fact  6.1.  For  each  integer  d  £  [0,  D\,  the  set  {P  £  M.DxD  :  0  ^  P  ^  I  and  tr  P  =  d}  is  the  convex 
hull  of  the  D  x  D  orthoprojectors  with  trace  d. 

To  prove  this  fact,  it  suffices  to  apply  a  diagonalization  argument  and  to  check  that  the  set 
{A  £  WD  :  i  \  =  d  and  0  <  Aj  <  1}  is  the  convex  hull  of  the  set  of  vectors  that  have  d  ones 

and  D  —  d  zeros.  See  [OW92]  for  a  discussion  of  this  result. 

6.1.2.  A  Weaker  Relaxation?  Another  question  is  whether  a  weaker  relaxation  of  reaper  might 
be  effective.  As  it  turns  out,  only  one  of  the  semidehnite  constraints  is  really  necessary.  Consider 
the  modified  optimization  problem 

minimize  ^  ||cc  —  Px\\  subject  to  0  ^  P  and  trP  =  ci.  (6-1) 

x&X 

This  problem  may  be  substantially  easier  to  solve  in  practice  with  general-purpose  software  such 
as  CVX  [GB08,  GB10].  The  modification  (6.1)  is  equivalent  to  (1.4)  in  the  following  sense. 

Theorem  6.2  (Weak  reaper).  The  problem  (6.1)  has  an  optimal  point  P+  that  satisfies  P*  ===!  I. 

We  provide  a  constructive  proof  of  Theorem  6.2  in  Appendix  D  that  converts  any  optimal  point 
of  (6.1)  into  an  optimal  point  that  satisfies  the  extra  semidehnite  constraint. 

6.1.3.  Deadzone  Formulation.  In  the  next  two  sections,  we  consider  some  ways  to  adapt  reaper 
that  preserve  convexity  and  yet  may  improve  robustness. 

First,  we  consider  a  modified  objective  function  that  ignores  small  residuals  entirely  so  that  the 
large  residuals  are  penalized  even  more  severely. 

minimize  nrax{0,  \\x  —  Px\\  —  5}  subject  to  0  ^  P  =<!  I  and  tv  P  =  d.  (6-2) 

x&X 

In  other  words,  no  point  contributes  to  the  objective  unless  the  model  explains  it  poorly.  The 
positive  parameter  5  controls  the  width  of  the  deadzone.  This  approach  is  similar  in  spirit  to  a 
soft-margin  classifier  [HTF09].  One  conceptual  disadvantage  of  the  deadzone  formulation  (6.2)  is 
that  it  probably  does  not  admit  an  exact  recovery  guarantee. 


6.1.4.  Outlier  Identification  and  Model  Fitting.  In  practice,  a  two-step  approach  can  lead  to  sub¬ 
stantially  better  linear  models  for  data.  First,  we  use  a  robust  method  to  find  a  subspace  that  fits 
the  data.  Next,  we  identify  and  discard  points  that  are  poorly  explained  by  the  model.  Finally,  we 
fit  a  subspace  to  the  reduced  data,  either  using  a  robust  method  or  a  classical  method.  See  [RL87] 
for  an  exposition  of  this  idea. 

We  can  amplify  this  approach  into  an  iterative  reweighted  t\  method  [CWB08,  KXAH10].  Set 
the  iteration  counter  k  <—  0,  and  set  f3x  £-  1  for  each  a;  £  A.  At  each  step,  we  solve  a  weighted 
version  of  the  reaper  problem: 

minimize  fix  \\x  —  Px ||  subject  to  0  P  I  and  tr  P  =  d.  (6-3) 


Then  we  update  the  weights  according  to  some  rule.  For  example, 


fix 


1 

max  {<5,  II*  —  *||  } 


for  each  *  £  X. 


By  canceling  off  the  magnitude  of  the  residual  in  (6.3),  we  hope  to  minimize  the  number  of  points 
that  are  poorly  explained  by  the  model.  We  repeat  this  procedure  until  we  do  not  see  any  additional 
improvement.  In  a  small  experiment,  we  found  that  a  single  reweighting  step  already  provided  most 
of  the  benefit. 
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6.1.5.  Incorporating  Centering  into  reaper.  In  Section  1.4.1,  we  discussed  the  importance  of  cen¬ 
tering  data  before  fitting  a  subspace.  In  this  paragraph,  we  consider  a  direct  method  for  finding  an 
affine  space  that  explains  the  data  instead  of  using  a  separate  centering  step.  There  is  a  natural 
convex  formulation  for  fitting  an  affine  model  to  data. 

minimize  ^  \\x  —  Px  —  c||  subject  to  0  P  =<:  I  and  tr  P  =  d.  (6-4) 

The  decision  variables  here  are  the  center  c  and  the  symmetric  matrix  P.  A  variant  of  Algorithm  4.2 
can  be  used  to  solve  this  problem  efficiently.  We  have  obtained  deterministic  recovery  guarantees 
for  (6.4)  when  the  inliers  are  contained  inside  an  affine  space.  For  brevity,  we  omit  the  details. 

The  main  shortcoming  of  (6.4)  is  that  we  do  not  see  a  way  to  spherize  the  centered  data  within 
the  framework  of  convex  optimization.  We  have  found  that  S-REAPER  is  often  more  powerful  than 
reaper,  so  we  believe  it  is  better  to  center  and  spherize  the  data  before  fitting  a  model. 

6.1.6.  Rounding  the  Solution.  For  real  data,  it  is  common  that  the  solution  P*  to  the  reaper 
problem  has  rank  that  exceeds  d.  In  our  numerical  work,  we  select  a  d-dimensional  model  by 
forming  the  span  of  the  d  dominant  eigenvectors  of  P+.  This  approach  is  effective  in  practice,  but 
it  hardly  seems  canonical.  It  is  likely  that  there  are  other  good  ways  to  achieve  the  same  end. 

Let  us  mention  three  alternative  ideas.  First,  we  might  take  the  entire  range  of  P*  as  our  model. 
It  may  have  more  dimensions  than  we  specified,  but  it  will  surely  provide  a  better  fit  than  the 
lower-dimensional  subspace. 

Second,  we  can  adjust  the  parameter  in  the  reaper  problem  until  the  solution  has  rank  d.  A 
quick  way  to  accomplish  this  is  to  apply  bisection  on  the  interval  [0,  d\.  This  method  also  works 
well  in  practice. 

Third,  we  can  randomly  round  the  matrix  P*  to  a  subspace  with  expected  dimension  d.  To  do 
so,  let  (A be  the  eigenpairs  of  P*.  Draw  Bernoulli  variables  8t  with  success  probability  A*, 
and  form  spanjuj  :  5,  =  1}.  This  approach  is  inspired  by  work  on  randomized  rounding  of  linear 
programming  solutions  to  combinatorial  problems. 

6.2.  Extensions  of  the  Analysis.  The  theoretical  analysis  in  this  paper  is  designed  to  provide  an 
explanation  for  the  numerical  experiments  that  we  have  conducted.  Nevertheless,  there  are  many 
ways  in  which  the  simple  results  here  could  be  improved. 

6.2.1.  Refining  the  Data  Models.  The  In  Sz  Out  Model  on  page  10  is  designed  to  capture  the  essential 
attributes  of  a  dataset  where  the  inliers  are  restricted  to  a  subspace  while  the  outliers  can  appear 
anywhere  in  the  ambient  space.  The  permeance  statistic  and  the  linear  structure  statistic  capture 
the  basic  properties  of  the  data  that  allow  us  to  establish  exact  recovery.  It  would  be  interesting 
to  identify  geometric  properties  of  the  data  that  allow  us  to  control  these  statistics. 

The  Haystack  Model  on  page  6  is  a  very  special  case  of  the  In  &;  Out  Model,  where  the  inliers  are 
isotropic  normal  variables  supported  on  a  subspace  and  the  outliers  are  isotropic  normal  variables 
on  the  ambient  space.  It  would  be  more  realistic  to  allow  anisotropic  data  distributions  and  to 
consider  extensions  beyond  the  Gaussian  case. 

Another  worthwhile  direction  is  to  study  the  performance  of  REAPER  and  S-REAPER  when  the 
inliers  are  contaminated  with  noise.  In  this  case,  we  cannot  hope  for  exact  recovery  guarantees, 
but  experiments  indicate  that  the  optimization  methods  are  still  effective  at  finding  linear  models 
that  are  comparable  with  the  performance  of  oracle  PCA  on  the  inliers. 

6.2.2.  Connections  with  Other  Algorithms.  Recht  [Recl2]  has  observed  that  the  IRLS  algorithm  is 
very  similar  to  an  EM  algorithm  for  the  normal  mixture  model.  There  are  also  evident  parallels  with 
the  multiplicative  weight  framework  [Kal07]  that  is  studied  by  the  online  algorithms  community. 
It  would  be  interesting  to  understand  whether  there  are  deeper  connections  among  these  methods. 
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6.3.  Harder  Problems.  A  very  large  number  of  applications  require  that  we  find  a  linear  model 
for  a  noisy  dataset.  It  is  also  common  that  we  need  to  fit  a  more  complicated  model  that  is  built 
out  of  simpler  linear  structures.  The  methods  in  this  work  are  also  relevant  in  these  situations. 

6.3.1.  Manifold  Learning.  A  common  component  in  methods  for  manifold  learning  is  a  local  PCA 
step  (see,  e.g.,  [LV07]).  The  idea  is  to  look  at  small  neighborhoods  in  the  data  and  to  fit  a  linear 
model  to  each  neighborhood.  The  algorithms  then  stitch  these  pieces  together  to  form  a  manifold. 
In  this  application,  it  is  common  for  the  data  to  contain  outliers  because  the  manifold  usually  has 
some  curvature  and  the  neighborhood  may  also  contain  points  from  different  parts  of  the  manifold. 
Robust  methods  for  linear  modeling  could  be  very  helpful  in  this  setting. 


6.3.2.  Hybrid  Linear  Modeling.  There  are  many  problems  where  the  data  lies  on  a  union  of  sub¬ 
spaces,  rather  than  a  single  subspace.  The  techniques  in  this  paper  are  very  relevant  for  this 
problem  because,  from  the  point  of  view  of  each  subspace,  the  observations  from  the  other  sub¬ 
spaces  are  outliers.  Suppose  that  A  is  a  set  of  observations,  and  we  would  like  to  fit  m  subspaces 
with  dimensions  di, . . . ,  dm .  The  following  optimization  offers  one  way  to  approach  this  problem. 

minimize  E  El  ||x  —  Pix\\  subject  to  0  ^  P;  ^  I,  tr  Pi  =  di,  and 


[Pi  P2 


<  1  +  a. 


(6.5) 


We  need  the  third  constraint  to  prevent  the  decision  variables  Pt  from  coalescing.  The  parameter 
a  >  0  reflects  how  much  overlap  among  the  subspaces  we  are  willing  to  tolerate. 

It  would  be  very  interesting  to  study  the  empirical  performance  of  this  approach  and  to  develop 
theory  that  explains  when  it  can  reliably  fit  hybrid  linear  models  to  data.  The  optimization 
problem  (6.5)  is  inspired  by  the  formulations  in  [ZSL09,  LZ11]. 


6.3.3.  Generalized  Linear  Models.  In  this  paper,  we  have  focused  on  a  robust  method  for  fitting  a 
standard  linear  model.  Generalized  linear  models  (GLMs)  also  play  an  important  role  in  statistical 
practice.  Unfortunately,  the  specific  methods  in  this  paper  do  not  apply  directly  to  GLMs.  To 
make  the  extension,  we  need  to  replace  the  squared  Euclidean  loss  with  a  Bregman  divergence 
that  is  derived  from  the  exponential  family  of  the  GLM  [BMDG05].  In  this  setting,  Bregman 
projectors  play  the  role  of  orthogonal  projectors.  It  would  be  interesting  to  develop  a  method  for 
parameterizing  the  Bregman  projector  onto  a  subspace  as  a  function  of  the  subspace.  Then,  we 
might  study  how  to  relax  this  formulation  to  obtain  a  convex  optimization  problem. 


6.4.  Conclusions.  We  have  found  that  the  reaper  and  S-REAPER  formulations  provide  a  rigorous 
and  effective  method  for  fitting  low-dimensional  linear  models  to  data.  In  future  work,  we  hope  to 
improve  our  theoretical  understanding  of  these  methods,  and  we  plan  to  extend  them  to  related 
problems.  We  hope  that  practitioners  will  try  out  these  techniques  to  enhance  data  analysis  in 
science  and  engineering  applications. 
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Appendix  A.  Exact  Recovery  Conditions  for  the  In  &  Out  Model 

In  this  appendix,  we  establish  Theorem  3.1,  which  provides  recovery  conditions  for  the  deter¬ 
ministic  In  &  Out  Model.  The  proof  is  an  easy  consequence  of  the  following  sharper  result,  which 
we  prove  in  Section  A.l  below. 

Theorem  A.l.  Let  L  be  a  d-dimensional  subspace  ofWD,  and  let  X  be  a  dataset  that  conforms 
to  the  In  &  Out  Model  on  page  10.  Define  a  matrix  Xout  whose  columns  are  the  points  in  Xout , 
arranged  in  fixed  order.  Suppose  that 


“f  Y  K«> 

ueL  Mv 
||u||  =  l  x^Xn 


X) 


>  V2d  •  max  j  (II^x  Xout)(IILx  -X^t)1  ,  (II^x  Aout)(IIiAout)t  |.  (A.l) 


Then  II ^  is  the  unique  minimizer  of  the  reaper  problem  (1.4). 

Proof  of  Theorem  3.1  from  Theorem  A.l.  First,  we  check  that  the  sufficient  condition  (3.5)  for 
reaper  follows  from  (A.l).  The  permeance  statistic  (3.1)  equals  the  left-hand  side  of  (A.l): 


V(L)  :=  inf  Y 

U&L  “ 
||u||=l 


X) 


We  can  bound  the  right-hand  side  of  (A.l)  using  the  linear  structure  statistics  (3.3)  and  (3.4). 


(n£xXout)(nLj.jrout)t 


< 


|Xout||  =S(L±)-S(Rd). 


I  n  lx  Xout  | 

The  inequality  follows  because  the  spectral  norm  is  submultiplicative  and  ||Il£±||  =  1.  Similarly, 

|(n^rx^t)(nLxout)t||  <  5(lx)  •  s(rd). 


Introduce  the  latter  three  displays  into  (A.l)  to  conclude  that  (3.5)  is  a  sufficient  condition  for  IIx, 
to  be  the  unique  minimizer  of  REAPER. 

To  obtain  the  sufficient  condition  for  S-REAPER,  we  need  to  apply  the  sufficient  condition  for 
reaper  to  the  spherized  data.  In  this  case,  the  scale-invariant  permeance  statistic  (3.2)  matches 
the  left-hand  side  of  (A.l): 

P(L)  =  inf  Y  *)l- 

On  the  right-hand  side  of  (A.l),  we  replace  each  instance  of  Xout  with  the  spherized  matrix  _Xout. 
To  complete  this  step,  observe  that  two  separate  normalizations  are  redundant: 


n^x-Xout  —  n^xAout. 


Therefore, 


Likewise, 


(nL±xoat)(nL±xoat)t 


< 


n  L±  Aout 


x, 


out 


=  S(LX)  -S(Rd). 


(nLxAout)(nLxout)t  <  5(lx)  •  s{rd) 


Introduce  the  latter  three  displays  into  (A.l)  to  conclude  that  (3.6)  is  a  sufficient  condition  for  II  £, 
to  be  the  unique  minimizer  of  S-REAPER.  □ 
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A.l.  Proof  of  Theorem  A.l.  Throughout  this  section,  we  use  the  notation  from  the  In  &  Out 
Model  on  page  10.  In  particular,  recall  that  L  is  a  d-dimensional  subspace  of  ,  and  let  X  = 
Xm  U  Tout  be  the  set  formed  from  a  set  Ajri  of  inliers  located  in  L  and  a  set  Tout  of  outliers  located 
in  Rd  \  L. 

The  proof  of  Theorem  A.l  is  based  on  a  perturbative  argument.  Observe  that  P  =  II ^  +  A  is 
feasible  for  the  reaper  problem  (1.4)  if  and  only  if  the  perturbation  A  satisfies 

0  ^  YYl  +  A  ^  I  and  tr  A  =  0.  (A. 2) 

We  demonstrate  that,  under  the  condition  (A.l),  the  objective  of  reaper  evaluated  at  11^  +  A  is 
strictly  larger  than  the  objective  evaluated  at  IIx,. 

To  establish  Theorem  A.l,  we  proceed  through  a  series  of  (very)  technical  results.  The  first 
lemma  identifies  a  variational  inequality  that  is  sufficient  for  the  overall  result  to  hold.  The  proof 
appears  below  in  Section  A. 2. 

Lemma  A. 2  (Sufficient  Conditions  for  Optimality).  Decompose  the  matrix  A  into  blocks: 

a  =  ii/,  An^  +  nLiAnL  +  aii^i  +  yil±a.yil±  (A.3) 

=;Ai  =:Ai  =  A*  =:A3 

Suppose  that,  for  all  A  f  0  that  satisfy  (A. 2),  we  have  the  conditions 


E 

*6  Yin 

IIAizH 

>  V2  J2  (As> 

®GA’out 

and 

(A.4) 

E 

*6  Yin 

11*2*11 

>  ^2  £  (a2, 

(A.5) 

Then  IT^  is  the  unique  minimizer  of  the  reaper  problem  (1.4). 

We  must  check  that  the  two  conditions  in  Lemma  A. 2  hold  under  the  hypothesis  (A.l)  of  Theo¬ 
rem  A.l.  The  proof  of  this  result  appears  below  in  Section  A. 5. 


Lemma  A. 3  (Checking  the  Conditions).  The  inequality  (A. 4)  holds  when 

4=-  inf  £  \(u,  x)\>V2  (n^t)(nLiXout Y 

The  inequality  (A. 5)  holds  when 

4=-  inf  V  |(u,  x)\>V2  (nfTx()7t)(nLA0ut)t 
Vd  nUn  1  xcX- 

We  may  now  complete  the  proof  of  Theorem  A.l. 


(A.6) 


(A.7) 


Proof  of  Theorem  A.l  from  Lemmas  A. 2  and  A. 3.  The  hypothesis  (A.l)  implies  the  two  conditions 
in  Lemma  A. 3.  Therefore,  the  two  conditions  in  Lemma  A. 2  are  in  force,  which  means  that  11^  is 
the  unique  minimizer  of  REAPER.  □ 


A. 2.  Proof  of  Lemma  A. 2.  The  reaper  problem  (1.4)  is  convex,  so  it  suffices  to  check  that  the 
objective  increases  for  every  sufficiently  small  nonzero  perturbation  A  that  satisfies  the  feasibility 
condition  (A. 2).  The  difference  between  the  perturbed  and  the  unperturbed  objective  values  can 
be  expressed  as 

V(A):=  [ll(nL±  -  A)® ||  -  ||nL±as||]  (A.8) 

because  IIl.l  =  I  —  II^.  Our  goal  is  to  show  that  r/( A)  >  0  whenever  A  is  small  and  feasible. 
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We  split  the  expression  for  77(A)  into  a  sum  over  inliers  and  a  sum  over  outliers.  Each  inlier 
x  £  Xm  falls  in  the  subspace  L,  so  that  =  0.  Therefore, 


(nL±  —  A)*||  —  ||nL±aj||  =  ||  Asc ||  for  x  £  Xm. 


(A.9) 


To  evaluate  the  summands  in  (A. 8)  where  x  £  Xout .  recall  that  each  outlier  has  a  nontrivial 
projection  on  the  subspace  L^~ .  When  A  — >  0,  we  compute  that 


(nLx  -  A)*||  =  f||nLx®||2  -  2  {Ax,  uL±x)  + 1| a* 

2 


21 1/2 


=  l|nL±a; 


1  - 


\nL±x 


Ax,  UL±x)  +  Oi  || A 


1/2 


=  ||IILxa;||  —  ^ Ax,  HL±x^  +  0{  || A|| 2  )  for  *  £  Tfout- 
The  last  identity  follows  from  the  Taylor  development  of  the  square  root.  In  summary,  as  A  — >  0, 


(nLx  -  A)®||  -  ||nLx*||  =  -  {Ax,  nLxx)  +  of  || a 


for  x  €  X, 


out* 


(A.10) 


Introduce  the  inequality  (A.9)  for  the  inlier  terms  and  the  inequality  (A.  10)  for  the  outlier  terms 
into  the  definition  (A. 8)  to  obtain 

77(A)  =  y  ||A®||  —  y  (Ax,  JJL±x^  +  0(  || A 1 1 2  )  when  A  — *  0. 

X(!z?£ in  XdzPd, out 

Therefore,  77(A)  >  0  for  all  nonzero  feasible  perturbations  provided  that 

y  ||Aa:||  >  y  ^ Ax ,  Il^xaj^  for  each  A/0  that  satisfies  (A. 2).  (A. 11) 

X{zi\?out 

In  fact,  the  same  argument  demonstrates  that  (A. 11)  becomes  a  necessary  condition  for  11^  to 
minimize  (1.4)  as  soon  as  we  replace  the  strong  inequality  by  a  weak  inequality. 

To  continue,  we  use  the  block  decomposition  (A. 3)  to  break  the  sufficient  condition  (A.  11)  into 
two  pieces  that  we  can  check  separately.  First,  note  that  IILxA  =  A2II 1  +  A3lILx,  so  the 
right-hand  side  of  (A.  11)  splits  up  as 

J2  (ax,  nL±x)  =  y  (aoUlx,  nL±x)  +  y  ^A3nLxa:,  nLxcc) 

x£Xout  x€Xout  x£.Xowt 

=  y  (a2,  (ri 2Tx)(nLx)t)+  y  (a3,  {fhTx^n^xY) .  (A.12) 

x£Xout  x£Xout 

The  second  identity  holds  because  ( Ab ,  c)  =  (A,  cb t).  Next,  recall  that  each  inlier  falls  in  L ,  and 
use  orthogonality  to  check  that 

||  A* || 2  =  || (IIi,  +  nLx) AIl£2;||2  =  ||  Aia?||2  +  ||  A2*||2  for  x  £  Ajn. 

Therefore,  we  can  bound  the  left-hand  side  of  (A.ll)  below. 

y  ||Ax||  =  y  [|| Ai£e||2  +  |j A2£c||2]  >  -^=  y  [  f| Aiasll  -h  || A2a5||  ]  (A. 13) 

x£Xin  x£X-In  *  x£XIn 

Together,  the  assumptions  (A. 4)  and  (A. 5)  imply  that  the  right-hand  side  of  (A.  13)  exceeds  the 
right-hand  side  of  (A.12).  Therefore,  the  perturbation  bound  (A.ll)  holds. 
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A. 3.  Properties  of  a  Feasible  Perturbation.  To  simplify  the  conditions  in  Lemma  A. 2,  we 
need  to  collect  some  information  about  the  properties  of  a  feasible  perturbation. 

Sublemma  A. 4.  Suppose  that  A  satisfies  the  condition  (A. 2)  for  a  feasible  perturbation.  Then 
the  following  properties  hold. 

(i)  tr  Ai  +tr  A3  =  0, 

(ii)  Ai  ===:  0,  and 
(hi)  A 3  ^  0. 

In  addition,  assume  that  A2  0.  Then 

(iv)  Ai  0, 

(v)  A 3  /  0,  and 

(vi)  tr  A3  =  —  tr  Ai  >  0. 

In  particular,  Ai  =  0  or  A  3  =  0  implies  that  A2  =  0. 

Proof.  Assume  that  A  satisfies  the  condition  (A. 2)  for  a  feasible  perturbation.  To  prove  Claim  (i), 
apply  the  cyclicity  of  the  trace  to  see  that 

tr  A2  =  tr(nL±AIIx,)  =  tr((nLnL_L)A)  =  tr(0)  =  0. 

Take  the  trace  of  the  block  decomposition  (A. 3)  to  verify  that  tr  Ai  +  tr  A3  =  tr  A  =  0. 

To  check  (ii),  conjugate  the  relation  0  ^  +  A  by  the  orthoprojector  II^x  to  see  that 

0  A  nLx(nL  +  A)nLx  =  a3. 

Similarly,  Claim  (iii)  follows  when  we  conjugate  II l  +  A  ^  I  by  the  orthoprojector  11^. 

We  verify  (iv)  and  (v)  by  establishing  the  contrapositive.  If  Ai  =  0  or  A3  =  0,  then  A2  =  0. 
First,  we  argue  that  Ai  and  A3  are  zero  together.  Indeed,  assume  that  Ai  =  0.  Claim  (i)  implies 
that  tr(A3)  =  0.  But  then  (iii)  forces  A3  =  0.  Likewise,  A3  =  0  implies  that  Ai  =  0. 

Now,  suppose  Ai  =  A3  =  0.  According  to  the  Schur  complement  lemma  [And05,  Thm.  5.3], 

0^nL  +  A  ^  A^O  and  A3  ^  A2(IIL  +  Ai)f  A*, 

where  the  dagger  t  denotes  the  Moore-Penrose  pseudoinverse.  Since  A  is  feasible,  the  left-hand 
side  of  the  implication  is  true,  and  so 

a3  >  A2n[Al  =  A2nLAl2  =  A2Al2. 

Therefore,  A2  =  0  because  A3  =  0. 

Finally,  Claim  (vi)  holds  because  (iii)  and  (v)  together  imply  that  tr(A3)  >  0.  Invoke  (i)  to 
complete  the  proof.  □ 


A. 4.  Lower  Bounds  for  the  Sum  over  Inliers.  The  next  result  allows  us  to  obtain  lower  bounds 
for  the  left-hand  sides  of  the  conditions  in  Lemma  A. 2.  The  statement  involves  the  Schatten  1-norm 
ll'llsj,  which  returns  the  sum  of  the  singular  values  of  a  matrix  [Bha97,  p.  92]. 

Sublemma  A. 5.  The  following  relations  hold  for  every  matrix  A. 


inf 

nL±AnL 


E  ||nL±AIILa;||  > 

x&Xln 


inf 

tr(IIx  Ant)=l 

niin^o 


E  l|nLAnLa; 

x&Xiu 


1 

Vd 


inf 

u^L 

1*41=1 


E  Ku>  *)l- 

ajGAdn 


(A.14) 
(A- 15) 


Proof.  To  verify  the  relation  (A.14),  we  show  that  each  feasible  point  A  for  the  left-hand  infimum 
can  be  converted  into  a  feasible  point  A  for  the  right-hand  infimum  without  changing  the  objective 
value.  Let  A  be  an  arbitrary  matrix,  and  fix  a  singular  value  decomposition  nLxAIl£  =  UYIV1. 
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Define  the  positive  semidefinite  matrix  A  =  DSy*,  and  observe  that  A  =  II^An^  because 
S’VAtIIi  =  EF1.  First, the  unitary  invariance  of  the  trace  shows  that 

||nL±AnL||5i  =  trS  =  tr  A  =  t^n^AIIf,). 

Next,  let  x  be  an  arbitrary  vector,  and  calculate  that 

\\Ul±AUlx\\  =  ||El'VAta;||  =  ||Ax||  =  ||n£,AIl£,*|| . 

Therefore,  inequality  (A.  14)  is  valid. 

The  second  relation  (A.  15)  is  more  difficult.  The  basic  idea  is  to  replace  the  trace  constraint 
in  the  left-hand  infimum  with  a  Frobenius-norm  constraint.  Subject  to  this  new  constraint,  the 
infimum  has  a  simpler  form.  First,  factor  out  the  Frobenius  norm  ||IIrAIIr|L  from  the  left-hand 
side  of  (A.  15): 


inf 

tr(nL  aiil)=i 

nLAn^o 


£  ||nLAnLx 

x£Xin 


>  inf  ||nLAnL|L  x  inf  £  Iln^AIIiX 
tr(n^,  An^)=i  L  L"F  ||nLAnL||F=i 

nLAriz>o 


(A.16) 


To  compute  the  first  infimum  on  the  right-hand  side  of  (A.16),  note  that  the  rank  of  II^AIIx, 
cannot  exceed  d  because  L  is  d-dimensional  subspace.  Therefore,  by  diagonalization, 

inf  ||nLAnL||F  =  inf  {||A||  :  ||AL  =1,  Ae<}  =  4v  (A.17) 

tr(nI/Ani)=i  1  11  +J  y/d 

nLAnL^o 


To  compute  the  second  infimum,  we  use  a  more  careful  diagonalization  argument.  We  can  pa¬ 
rameterize  the  feasible  set  using  the  spectral  decomposition  Il^AIIx,  =  J2i=i  uiub  where  the 
eigenvalues  A j  are  nonnegative  and  the  eigenvectors  Ui  form  an  orthobasis  for  L.  Thus, 


inf 

||nLAnL||p= 


£  ||nLAnLx| 


inf  inf  £  £d  Xf  \  { 

orthobases  {uj}  for  L  ||A|j  =  l  TTl  ^ 


1  1/2 


Ui,  X) 


xexin  ^  ~  xexh 

This  expression  can  be  simplified  dramatically.  Make  the  identification  pi  =  A?  for  each  i  =  1, . . . ,  d. 
Observe  that  the  objective  on  the  right-hand  side  is  a  concave  function  of  the  vector  p.  Furthermore, 
p  is  constrained  to  the  convex  set  of  nonnegative  vectors  that  sum  to  one  (i.e.,  probability  densities). 
The  minimum  of  a  concave  function  on  a  compact  set  is  achieved  at  an  extreme  point,  so  the  infimum 
occurs  when  p  is  a  standard  basis  vector.  Therefore, 

nD/2 


inf  £  £  A? 

II All =i  L^*=i 


( Uj ,  x) 


fcGAfj 


=  mm 


E 

x&XiYI 


{Ui,  X) 


Combining  the  last  two  relations,  we  obtain 


inf 

||nLAnL||F= 


£  ||nLAnLx||  =  inf  £  |(«,  x)\. 

1  M  =  !  X^xin 


(A.18) 


X£X:r 


To  complete  the  proof  of  (A. 15),  we  introduce  (A. 17)  and  (A.18)  into  (A.16).  □ 

A. 5.  Proof  of  Lemma  A. 3.  We  are  now  prepared  to  check  that  the  conditions  in  Lemma  A. 3 
imply  the  conditions  in  Lemma  A. 2.  Throughout  the  argument,  we  assume  that  A  is  a  nonzero 
perturbation  that  satisfies  the  feasibility  condition  (A. 2),  so  we  have  access  to  the  results  in  Sub- 
lennna  A. 4. 

First,  observe  that  the  inequality  (A. 4)  is  homogeneous  in  A.  In  view  of  Sublemma  A. 4,  we 
can  scale  A  so  that  tr(As)  =  —  tr(Ai)  =  1.  Furthermore,  we  may  assume  that  Ai  is  negative 
semidefinite  and  A3  is  positive  semidefinite.  Therefore,  it  suffices  to  check  the  constrained  infimum 
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of  the  left-hand  side  of  (A. 4)  exceeds  the  constrained  supremum  of  the  right-hand  side.  It  follows 
immediately  from  Sublemma  A. 5  that 


inf 

trAl=— 1 
Ai=<0 


1 


inf 


x) 


reYin 

Meanwhile,  the  duality  between  the  the  Schatten  1-nornr  and  the  spectral  norm  gives 


Y  j|Aix||  >  -= 


V2  sup 
tr  A;J=1 

A3^0 


E  (a3>  (n^xn^aO1)  =  V2 

A’out 


E  (uL±x)(uL±xy 

ut 


=  V2 


(niiiollt)(niiXout)t 


The  hrst  equality  holds  because  the  matrix  inside  the  spectral  norm  happens  to  be  positive  semi- 
definite.  For  the  second  identity,  we  simply  identify  the  sum  as  the  product  of  the  two  matrices. 
Combine  the  last  two  displays  to  see  that  the  condition  (A. 4)  follows  from  (A. 6). 

The  second  part  of  the  proof  is  similar.  Observe  that  the  inequality  (A. 5)  is  homogeneous  in 
A2,  so  we  may  assume  that  HA2II5  =  1.  Sublemma  A. 5  delivers  the  bound 


inf 

Hulls'!  =1 


E  llAixll  ^ 

x&Xin 


1 

Vcl 


inf 

u£L 

1x11=1 


E  Kw’  *)l- 

tcSYin 


The  duality  argument  we  used  for  the  hrst  condition  now  delivers 


\/2  sup  E  (A2,  (U^x^UlxY) 
liA2llSl=1  xexout 


V2 


(nLxXout)(nLAout)t 


Combine  these  two  displays  to  see  that  (A. 5)  is  a  consequence  of  (A. 7). 


Appendix  B.  Exact  Recovery  Conditions  for  the  Haystack  Model 

In  this  appendix,  we  establish  exact  recovery  conditions  for  the  Haystack  Model.  To  accomplish 
this  goal,  we  study  the  probabilistic  behavior  of  the  (spherical)  permeance  statistic  and  the  (spher¬ 
ical)  linear  structure  statistic.  Our  main  result  for  the  Haystack  Model,  Theorem  B.l,  follows  when 
we  introduce  the  probability  bounds  into  the  deterministic  recovery  result,  Theorem  3.1.  Theo¬ 
rem  1.1,  the  simplified  result  for  the  Haystack  Model,  is  a  consequence  of  the  more  detailed  theory 
we  develop  here. 

Theorem  B.l.  Fix  a  parameter  c  >  0.  Let  L  be  an  arbitrary  d-dimensional  subspace  o/MD,  and 
draw  the  dataset  X  at  random  according  to  the  Haystack  Model  on  page  6. 

(i)  Let  1  <  d  <  D  —  1.  Suppose  that 

/|pta-(2  +  c)^>^-/Ep^'(^+1  +  C\/I)  '  (B'l8) 

Then  FR  is  the  unique  minimum  of  reaper  (1.4),  except  with  probability  3.5  e  c2rf/2. 

(ii)  Let  2  <  d  <  D  —  1.  Suppose  that 

f-P-m  -  (2  +  cV2)^  >  -  ^  -d-  0.5  •  (^  +  1  +  C/l)  •  (Rlb) 

Then  FR  is  the  unique  minimum  o/s-REAPER  (1.6),  except  with  probability  4e  c2d/2. 
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(iii)  Let  d  =  1  and  D  >  2.  Suppose  that 

(v^+!  +  ^)  ■  (B'lc) 

Then  II ^  is  £/ie  unique  minimum  o/s-REAPER  (1.6),  except  with  probability  3e  c~/2. 

We  prove  these  rather  daunting  expressions  below  in  Sections  B.l  and  B.2.  First,  let  us  examine 
the  content  of  Theorem  B.l  in  a  specific  regime.  Suppose  that 

D  — >  oo,  d  — *  oo,  and  d/D—t  0. 

Then  the  condition  (B.lb)  simplifies  to 

Pin  ^  \/t  '  (  \/ Pout  T  1 )  • 

This  inequality  matches  our  heuristic  (3.7)  exactly!  Now,  we  demonstrate  that  Theorem  B.l  con¬ 
tains  the  simplified  result  for  the  Haystack  model,  Theorem  1.1. 


Proof  of  Theorem  1.1  from  Theorem  B.l.  To  begin,  we  collect  some  basic  inequalities.  For  a  >  0, 
the  function  f(x)  =  x  —  a^/x  is  convex  when  x  >  0,  so  that 


x  —  a\J~x  =  f(x)  >  /(a2)  +  f'(a2)(x  —  a2)  = 


For  1  <  d  <  (D  —  1  )/2,  we  have  the  numerical  bounds 

D  D 


D-d-  0.5 


<2, 


D  -  0.5 


<  1.2,  and 


Finally,  recall  that  (a  +  b  +  c)2  <  3(a2  +  b2  +  c2)  as  a  consequence  of  Holder’s  inequality. 

To  prove  the  claim  (1.7)  of  Theorem  1.1,  we  apply  our  numerical  inequalities  and  the  fact 
(2  +  c)2  <  2(4  +  c2)  to  weaken  (B.la)  to  the  condition 


Pin  ~  7t(4  +  (?)  >  12  •  (pout  +  1  +  2c2) 

V  2  <T[n 

Apply  Theorem  B.l(i)  with  c  =  \/2f3  to  obtain  (1.7)  with  constants  Ci  =  47t  <  13  and  C2  =  2n  <  7 
and  C3  =  12\J-k/2  <  16. 

To  prove  the  claim  (1.8)  of  Theorem  1.1,  we  use  the  same  reasoning  to  weaken  (B.lb)  to  the 
condition 


Pin  ~  7r(4  +  2c2)  >  12 


3-*-{Pout  +  1  +  2c2^ 
5 


Apply  Theorem  B.l(ii)  with  c  =  y/2/3  to  obtain  (1.8)  in  the  range  2  <  d  <  (D  —  l)/2.  The  case 
d  =  1  follows  from  a  similar  argument  using  Theorem  B.l  (iii).  □ 


B.l.  Probability  Inequalities  for  the  Summary  Statistics.  The  proof  of  Theorem  B.l  re¬ 
quires  probability  bounds  on  the  permeance  statistic  V ,  the  linear  structure  statistic  S,  and  their 
spherical  counterparts  V  and  S.  These  bounds  follow  from  tail  inequalities  for  Gaussian  and  spher¬ 
ically  distributed  random  vectors  that  we  develop  in  the  next  two  subsections. 


B.1.1.  The  Permeance  Statistic.  Our  first  result  is  used  to  provide  a  high-probability  lower  bound 
for  the  permeance  statistic  V(L)  under  the  Haystack  Model. 


Lemma  B.2.  Suppose  g  1, . . .  ,gn  are  i.i.d.  normal(0,  I)  vectors  in  Mrf.  For  each  t  >  0, 


except  with  probability  e  !2 . 


(B.2) 
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Proof.  Add  and  subtract  the  mean  from  each  summand  on  the  left-hand  side  of  (B.2)  to  obtain 


The  second  sum  on  the  right-hand  side  has  a  closed  form  expression  because  each  term  is  the 
expectation  of  a  half-Gaussian  random  variable:  E  |  (u,  gf)  \  =  \/2/tt  for  every  unit  vector  u. 
Therefore, 


(B.4) 


To  control  the  first  sum  on  the  right-hand  side  of  (B.3),  we  use  a  standard  argument.  To  bound 
the  mean,  we  symmetrize  the  sum  and  invoke  a  comparison  theorem.  To  control  the  probability  of 
a  large  deviation,  we  apply  a  measure  concentration  argument. 

To  proceed  with  the  calculation  of  the  mean,  we  use  the  Rademacher  symmetrization 
lemma  [LT91,  Lem.  6.3]  to  obtain 


E 


sup 

■u!j=l 


n 

E 

i— 1 


n 


[( E  |  <tt,  gi)  \ )  —  |  (u,  gf)  |]  <2Ej3up^  ](«,  gf)\ 

i=  1 


U  =1 


The  random  variables  £\, . .  ,  ,en  are  i.i.d.  Rademacher  random  variables  that  are  independent  from 
the  Gaussian  sequence.  Next,  invoke  the  Rademacher  comparison  theorem  [LT91,  Eqn.  (4.20)]  with 
the  function  (p(- )  =  ]•]  to  obtain  the  further  bound 


n  n 

E  sup  [(E|(u,  -  |(«,  gi)\]  <  2E  sup  Y£i(U’  Si)  =  2E\\j2'‘=l£igi  ■ 

IMI=1  i=\  IMI=1  i= i 

The  identity  follows  when  we  draw  the  sum  into  the  inner  product  a  maximize  over  all  unit  vectors. 
From  here,  the  rest  of  the  argument  is  very  easy.  Use  Jensen’s  inequality  to  bound  the  expectation 
by  the  root-mean-square,  which  has  a  closed  form: 


E  sup  E  [(eKm>  9i) I )  -  gi)\]  < 2  E  E-1 

1 1 » .  II -I  —  LV  7  J  1 


IMI=1 


2—1 


1/2 


=  2  Vnd. 


(B.5) 


Note  that  the  mean  fluctuation  (B.5)  is  dominated  by  the  centering  term  (B.4)  when  n>i 
To  control  the  probability  that  the  fluctuation  term  is  large,  we  use  a  standard  concentration 
inequality  [Bog98,  Theorem  1.7.6]  for  a  Lipschitz  function  of  independent  Gaussian  variables.  Define 
a  real- valued  function  on  dxn  matrices:  f(Z)  =  sup||u||=1  Ya= i(\/2/7r— |(wj  zi)\)->  where  Zj  denotes 
the  All  column  of  Z .  Compute  that 

n  n 

\f(Z)-f(Z')\<  sup  £  l<«>  -*<>[<  E  N  -z'i\\<V^\\Z-  z'Hf  . 

i=  1  i=  1 


Therefore,  /  has  Lipschitz  constant  ^Jn  with  respect  to  the  Frobenius  norm.  In  view  of  the  esti¬ 
mate  (B.5)  for  the  mean,  the  Gaussian  concentration  bound  implies  that 

pj^sup  [(E|(w,  gi)\)  - \(u,  flj)|]  >  2Vnd  +  t^/n)j  <  e~t2/2.  (B.6) 

Introduce  the  bound  (B.6)  and  the  identity  (B.4)  into  (B.3)  to  complete  the  proof.  □ 


To  control  the  spherical  permeance  statistic,  we  need  a  variant  of  Lemma  B.2.  The  following 
observations  play  a  critical  role  in  this  argument  and  elsewhere.  Suppose  that  g  is  uniformly 
distributed  on  the  unit  sphere  in  Mrf,  and  let  r  be  a  x-distributed  random  variable  with  d  degrees 
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of  freedom  that  is  independent  from  g.  Then  r  g  ~  g,  where  g  is  a  standard  normal  vector.  To 
apply  this  fact,  we  need  some  bounds  for  the  mean  of  a  //-distributed  variable: 

Vd  —  0.5  <  E  r  <  Vd  when  r  ~  Xd-  (B.7) 

The  upper  bound  follows  from  Jensen’s  inequality,  while  the  lower  bound  is  a  refinement  of 
Gautschi’s  inequality  due  to  Kershaw  [Ker83]. 

Lemma  B.3.  Suppose  gq, . . . ,  gn  are  i.i.d.  random  vectors  uniformly  distributed  on  the  unit  sphere 
Sd_1  in  Rd.  When  d  =  1, 

n 

inf  ^2\(u,  gi)\  =  n.  (B.8) 

\\U  =1 

11  11  1=1 

Assume  that  d  >  2.  For  all  t  >  0, 

nilfi.gK*'.  *>i>Vf'^-2  <BJ> 

except  with  probability  e-*2/2. 


Proof.  When  d  =  1,  the  argument  is  particularly  simple.  A  random  variable  uniformly  distributed 
on  the  0-dimensional  sphere  has  a  Rademacher  distribution.  The  result  (B.8)  is  immediate. 

Assume  that  d  >  2.  The  proof  is  analogous  to  the  argument  for  the  Gaussian  case,  Lemma  B.2. 
Add  and  subtract  the  mean  from  the  left-hand  side  of  (B.9)  to  reach 


9i)  I  > 


n 

inf  [Kw’  Si)\  9i 

IX  =1 
"  1=1 


9i)  I- 


(B.10) 


To  bound  the  second  term  on  right-hand  side,  we  introduce  independent  //-variates  ri  to  facilitate 
comparison  with  a  Gaussian.  Indeed, 


El<“.  Si) I  =  ~  |(u,  ngf) |  >  -E|(u, 


The  last  inequality  follows  from  (B.7).  Therefore, 


l~2  n 


(B.ll) 


Next,  we  bound  the  mean  of  the  first  term  on  the  right-hand  of  (B.10).  By  repeating  the 
symmetrization  and  comparison  argument  from  Lemma  B.2,  we  reach 


E  sup  [(EKW,  9i)\)  -  |<«, 


<  2 


IMI=i 


i=\ 


E 


ETL 

i= 1 


£i  9i 


1/2 


=  2  y/n. 


The  identity  depends  on  the  fact  that  each  vector  gi  has  unit  norm. 

To  control  the  probability  of  a  large  deviation,  we  need  an  appropriate  concentration  inequality 
for  the  function  /  defined  on  page  34.  We  apply  a  result  for  a  Lipschitz  function  on  a  product 
of  spheres  [LedOl,  Thm.  2.4],  with  the  parameter  c  =  d  —  1,  along  with  the  definition  [LedOl, 
Prop.  1.11]  of  the  concentration  function  to  reach 

P|||S||P1  iZ[(EKw’  9i)\)-\(u,  Qi) l]  >  2Vn  +  t\J-^—^  <  e-<-/2.  (B.12) 

Introduce  (B.12)  and  (B.ll)  into  (B.10)  to  complete  the  proof  of  (B.9).  □ 
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B.1.2.  Probability  Bounds  for  the  Linear  Structure  Statistics.  To  control  the  linear  structure  statis¬ 
tic  S ,  we  need  a  tail  bound  for  the  maximum  singular  value  of  a  Gaussian  matrix.  The  following 
inequality  is  a  well-known  consequence  of  Slepian’s  lemma.  See  [DS01,  Thm.  2.13]  and  the  er¬ 
rata  [DS03]  for  details. 


Proposition  B.4.  Let  G  be  an  rn  x  n  matrix  whose  entries  are  i.i.d.  standard  normal  random 
variables.  For  each  t  >  0, 

P  {||G||  >  s/m  +  \fn  +  i}  <  1  —  <3 >(t)  <  e”^2, 
where  <3>(i)  is  the  Gaussian  cumulative  density  function 

1  r* 


m  := 


s/2tt  J-c 


~r  G  dr. 


To  bound  the  spherical  linear  structure  statistic  S,  we  need  a  related  result  for  random  matrices 
with  independent  columns  that  are  uniformly  distributed  on  the  sphere.  The  argument  bootstraps 
from  Proposition  B.4. 


Lemma  B.5.  Let  S  be  an  rn  x  n  matrix  whose  columns  are  i.i.d.  random  vectors,  uniformly 
distributed  on  the  sphere  in  Mm.  For  each  t  >  0, 

p(  ys+ym  +  tl<l5e-tv2  (B.13) 

{  ~  y/m  -  0.5  j  “ 

Proof.  Fix  6  >  0.  The  Laplace  transform  method  shows  that 

<?|i  >  Vn  +  y/rn  +  t\  <  ^_0{v^+v^+t)  .  w  e(?>/s=o3||S|| 

“  y/m  -  0.5  J  “ 

We  compare  ]|S,||  with  the  norm  of  a  Gaussian  matrix  by  introducing  a  diagonal  matrix  of  y- 
distributed  variables.  The  rest  of  the  argument  is  purely  technical. 

Let  r  =  (rq, . . . ,  rn)  be  a  vector  of  i.i.d.  y-distributed  random  variables  with  m  degrees  of  freedom. 
Recall  that  r%  gi  ~  gL ,  where  g,  is  uniform  on  the  sphere  and  gi  is  standard  normal.  Using  the 
bound  (B.7),  which  states  that  \Jm  —  0.5  <  Em,  and  Jensen’s  inequality,  we  obtain 

Ee<VrrU05||S||  <  jg  diag(r)S||  <  ]E  e^ll<^ll , 


where  G  is  an  m  x  n  matrix  with  i.i.d.  standard  normal  entries. 

Define  a  random  variable  Z  :=  ||G||  —  s/n  —  \fm,  and  let  Z+  :=  max{Z,  0}  denote  its  positive 
part.  Then 


eet  ■  P  <  Eeez  <  Eeez+  =  1  + 


J)t 


e” '  •  P{J^+  >  t}  dr. 

Apply  the  cdf  bound  in  Proposition  B.4,  and  identify  the  complementary  error  function  erfc. 


/  o 


eet  ■  P  <  1  +  - 


/  o 


e6r  ■  erfc  (  ~^=  )  dr, 


s/V 


A  computer  algebra  system  will  report  that  this  frightening  integral  has  a  closed  form: 

ej  e°T  ■  erfc  ^ r  =  (erf(0)  +  1)  —  1  <  2ee“//2  —  1. 

We  have  used  the  simple  bound  erf(0)  <  1  for  9  >  0.  In  summary, 


P  <  e 


-et 


1 

L2+e' 


e»2/2 


Select  6  =  t  to  obtain  the  advertised  bound  (B.13). 


□ 
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B.2.  Proof  of  Theorem  B.l.  Let  Xout  be  a  D  x  Nou t  matrix  whose  columns  are  the  outliers 
x  £  Tout,  arranged  in  fixed  order.  Recall  that  the  inlier  sampling  ratio  p\n  :=  N\n/d  and  the  outlier 
sampling  ratio  pOVLt  :=  Nout/D. 

First,  we  establish  Claim  (i).  The  N[n  inliers  are  drawn  from  a  centered  Gaussian  distribution 
on  the  d-dimensional  space  L  with  covariance  (cr2n/d)  1^.  Rotational  invariance  and  Lemma  B.2, 
with  t  =  cy/d,  together  imply  that  the  permeance  statistic  (3.1)  satisfies 


V(L)> 


°"in 


(2  +  c)  \J  N[nd 


—  Cm 


y/~d 


except  with  probability  e~c~d^2.  The  Nout  outliers  are  independent,  centered  Gaussian  vectors 
in  WD  with  covariance  (a2ut/ D)I.  Proposition  B.4,  with  t  =  cy/d,  delivers  a  bound  for  the  linear 
structure  statistic  (3.3): 

S(Rd)  =  ll-Xoutll  <  [v/iVcun  +  Vd  +  cVd]  =  (Tout 


V  Pout  +  1  +  c\j  — 


except  with  probability  e~c2rf//2.  Rotational  invariance  implies  that  the  columns  of  IlLx  Jfout  are 
independent  vectors  that  are  uniformly  distributed  on  the  unit  sphere  of  a  (D  —  d)-dimensional 
space.  Lemma  B.5  yields 


S{La 


v/Tout  +  V~D~ 


V Pont  +  1  +  c\l  Yf 


except  with  probability  1.5 e  c~d/2.  Under  the  assumption  (B.la),  we  conclude  that 


V(L)  >  V2d  ■  SiL-1)  ■  S(Rd) 

except  with  probability  3.5  e~c~d/2.  Apply  Theorem  3.1  to  complete  the  argument. 

To  establish  Claim  (ii),  we  also  need  to  bound  the  spherical  permeance  statistic  (3.2)  and  the 
spherical  linear  structure  statistic  (3.4).  Assume  that  d  >  2.  The  Nin  spherized  inliers  are  uniformly 
distributed  over  the  unit-sphere  in  the  d-dimensional  space  L.  Lemma  B.3,  with  t  =  cy/d,  implies 
that  the  spherical  permeance  statistic  satisfies 


V{L)> 


Nin 

\fd 


2'/w°  -  cJWi a  ^ 


(2  +  C\/2)  sj /9in 


except  with  probability  e  c~d/2.  The  second  inequality  follows  from  the  numerical  bound  d/ (d—  1)  < 
2  for  d  >  2.  For  the  outliers,  note  that  the  IVout  columns  of  Xout  are  independent  random  vectors, 
uniformly  distributed  on  the  unit  sphere  in  RD ,  so  Lemma  B.5  shows  that 


S(Rd)  =  UXoutll  < 


V  Tout  +  +  cyfd 


/  D 

fd 1 

J  D-  0.5 

y/ Pout  +  1  +  C\J 

VD  -  0.5 

except  with  probability  1.5e~c~d^2 .  Under  the  assumption  (B.lb),  we  conclude  that 


V(L)  >  y/2d-S(LL)  -5(Md), 

except  with  probability  4e^c~rf/2.  Theorem  3.1  establishes  the  result. 

Finally,  Claim  (iii)  follows  from  a  refined  bound  on  V(L)  that  holds  when  d  =  1.  In  this  case, 
Lemma  B.3  yields  V(L)  =  N[n.  The  result  follows  in  the  same  manner  as  part  (ii). 
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Appendix  C.  Analysis  of  the  IRLS  Algorithm 

This  appendix  contains  the  details  of  our  analysis  of  the  IRLS  method,  Algorithm  4.2.  First, 
we  verify  that  Algorithm  4.1  reliably  solves  the  weighted  least-squares  subproblem  (4.1).  Then,  we 
argue  that  IRLS  converges  to  a  point  near  the  true  optimum  of  the  REAPER  problem  (1.4). 

C.l.  Solving  the  Weighted  Least-Squares  Problem.  In  this  section,  we  verify  that  Algo¬ 
rithm  4.1  correctly  solves  the  weighted  least-squares  problem  (4.1).  The  following  lemma  provides 
a  more  mathematical  statement  of  the  algorithm,  along  with  the  proof  of  correctness. 

Lemma  C.l  (Solving  the  Weighted  Least-Squares  Problem).  Assume  that  0  <  d  <  D,  and  suppose 
that  X  i  a  set  of  observations  in  It0.  For  each  x  e  X,  let  j3x  be  a  nonnegative  weight.  Form  the 
weighted  sample  covariance  matrix  C ,  and  compute  its  eigenvalue  decomposition: 

C  :=  ^2  /3X  xxt  =  U AU1  where  Ai  >  •  •  •  >  A#  >  0. 

x&X 

When  rank(C)  <  d,  construct  a  vector  u  £  via  the  formula 

!/:=(l,  ...,  1,  d-  [d\,  0,  ...,  0)1.  (C.l) 

's - v - 

\_d\  times 

When  rank(C)  >  d,  define  the  positive  quantity  9  implicitly  by  solving  the  equation 

£  =  d.  (c.2) 

i=i  Ai 

Construct  a  vector  v>  £  MD  whose  components  are 

Vi  :=  —  t  ^ +  fori  =  (C.3) 

^ i 

In  either  case,  an  optimal  solution  to  (4.1)  is  given  by 

P*  :=U  ■  diag(iz)  •  Ul.  (C.4) 

In  this  statement,  (a)+  :=  max{0,  a},  we  enforce  the  convention  0/0  :=  0,  and  diag  forms  a 
diagonal  matrix  from  a  vector. 

Proof  of  Lemma  C.l.  First,  observe  that  the  construction  (C.4)  yields  a  matrix  P*  that  satisfies 
the  constraints  of  (4.1)  in  both  cases. 

When  rank(C)  <  d ,  we  can  verify  that  our  construction  of  the  vector  v  yields  a  optimizer  of  (4.1) 
by  showing  that  the  objective  value  is  zero,  which  is  minimal.  Evaluate  the  objective  function  (4.1) 
at  the  point  P*  to  see  that 

£  ||(I-P*)®||2=tr[(I-P*)C(I-P*)]  =  ^(l-^)2Ai  (C.5) 

x£X  i=l 

by  definition  of  C  and  the  fact  that  C  and  P*  are  simultaneously  diagonalizable.  The  nonzero 
eigenvalues  of  C  appear  among  Ai, . . . ,  Aidj .  At  the  same  time,  1  —  ty;  =  0  for  each  i  =  1, ...,  [d\. 
Therefore,  the  value  of  (C.5)  equals  zero  at  P*. 

Next,  assume  that  rank(C)  >  d.  The  objective  function  in  (4.1)  is  convex,  so  we  can  verify  that 
P*  solves  the  optimization  problem  if  the  directional  derivative  of  the  objective  at  P*  is  nonnegative 
in  every  feasible  direction.  A  matrix  A  is  a  feasible  perturbation  if  and  only  if 

0  =<!  P*  +  A  ^  I  and  tr  A  =  0. 

Let  A  be  an  arbitrary  matrix  that  satisfies  these  constraints.  By  expanding  the  objective  of  (4.1) 
about  P*,  easily  compute  the  derivative  in  the  direction  A.  Therefore,  the  condition 

-  (A,  (I  -  P*)C>  >  0 


(C.6) 
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ensures  that  the  derivative  increases  in  the  direction  A. 

First,  note  that  the  quantity  9  can  be  defined.  Indeed,  the  left-hand  side  of  (C.2)  equals  rank(C) 
when  0  =  0,  and  it  equals  zero  when  9  >  X\.  By  continuity,  there  exists  a  value  of  6  that  solves 
the  equation.  Let  i*  be  the  largest  index  where  A ^  >  9,  so  that  i/j  =  0  for  each  i  >  i*.  Next, 
define  M  to  be  the  subspace  spanned  by  the  eigenvectors  . . . ,  up.  Since  za  is  the  eigenvalue 

of  P*  with  eigenvector  w,,  we  must  have  II mP*FIm  =  0.  It  follows  that  UmATIm  '%=  0  because 

n m(p*  +  a)iim  >  o. 

To  complete  the  argument,  observe  that 

(1  -  Vi)\  =  A i  -  (A i  -  9)+  =  min{Aj,  9). 

Therefore,  (I  —  P*)C  =  U  ■  diag(min{Aj,  9})  ■  Ut.  Using  the  fact  that  tr  A  =  0,  we  obtain 
(A,  (I  -  P*)C>  =  (A,  U  •  diag(min{Aj,  9}  -  9)  •  Ul) 

=  (A,  U  •  diag(0, . . . ,  0,  An+i  -  9, . . . ,  AD  -  9)  ■  Ul) 

=  :Z 

Since  A,;  <  9  for  each  i  >  **,  each  eigenvalue  of  Z  is  nonpositive.  Furthermore,  HmZHm  =  Z.  We 
see  that 

(A,  (i  -  p*)C)  =  (A,  nMznM)  =  (nMAuM,  z)  <  o, 

because  the  compression  of  A  on  M  is  positive  semidefinite  and  Z  is  negative  semidefinite.  In 
other  words,  (C.6)  is  satisfied  for  every  feasible  perturbation  A  about  P*.  □ 

C.2.  Convergence  of  IRLS.  In  this  section,  we  argue  that  the  IRLS  method  of  Algorithm  4.2 
converges  to  a  point  whose  value  is  nearly  optimal  for  the  reaper  problem  (1.4).  The  proof  consists 
of  two  phases.  First,  we  explain  how  to  modify  the  argument  from  [CM99]  to  show  that  the  iterates 
p(fc)  converge  to  a  matrix  P$,  which  is  characterized  as  the  solution  to  a  regularized  counterpart 
of  reaper.  The  fact  that  the  limit  point  P5  achieves  a  near-optimal  value  for  reaper  follows  from 
the  characterization. 


Proof  sketch  for  Theorem  f.l.  We  find  it  more  convenient  to  work  with  the  variables  Q  :=  I  —  P 
and  Q(l:)  :=  I  —  P^k\  First,  let  us  define  a  regularized  objective.  For  a  parameter  5  >  0,  consider 
the  Huber-like  function 


Hs(x,y) 


Ht  +  s).  o<v<s 

+  v>s- 


We  introduce  the  convex  function 


F(Q):=  '£m\Q*\\>.  ]IQ*II) 

x&X 

^-/{a::||Qa;||>S}  ^  5Z{a;:|jQa;||<5} 


The  second  identity  above  highlights  the  interpretation  of  F  as  a  regularized  objective  function 
for  (1.4)  under  the  assignment  Q  =  I  —  P.  Note  that  F  is  continuously  differentiable  at  each 
matrix  Q,  and  the  gradient 

11  maxOQx||  ,  i} ' 

The  technical  assumption  that  the  observations  do  not  lie  in  the  union  of  two  strict  subspaces  of 
implies  that  F  is  strictly  convex;  compare  with  the  proof  [ZL11,  Thm.  1],  We  define  Q$  to  be 
the  solution  of  a  constrained  optimization  problem: 


Qs  :=  arg  min  F(Q). 

tr  Q=D—d 
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The  strict  convexity  of  F  implies  that  Q$  is  well  defined. 

The  key  idea  in  the  proof  is  to  show  that  the  iterates  Q1'^  of  Algorithm  4.2  converge  to  the 
optimizer  Q$  of  the  regularized  objective  function  F.  We  demonstrate  that  Algorithm  4.2  is  a  gen¬ 
eralized  Weiszfeld  method  in  the  sense  of  [CM99,  Sec.  4],  After  defining  some  additional  auxiliary 
functions  and  facts  about  these  functions,  we  explain  how  the  argument  of  [CM99,  Lem.  5.1]  can 
be  adapted  to  prove  that  the  iterates  of  Q ^  =  I  —  P — >  Qs.  The  only  innovation  required  is  an 
inequality  from  convex  analysis  that  lets  us  handle  the  constraints  0  ^  Q  =<!  I  and  tr  Q  =  D  —  d. 
Now  for  the  definitions.  We  introduce  the  potential  function 

G(Q,  Q{k))-.=  J2Hs(\\Qx\\,  \\Q^x\\). 

Then  G{-,  Q^)  is  a  smooth  quadratic  function.  By  collecting  terms,  we  may  relate  G  and  F 
through  the  expansion 

G(Q,  Q{k) )  =  F(QW)  +  (Q-  Q(k\  VF(Q(fe)))  +  \(Q-  Q[k\  C{Q{k)){Q  -  Q(fc)))  , 

where  C  is  the  continuous  function 

C(qW)  ■=  V  _ — _ 

Next,  we  verify  some  facts  related  to  Hypothesis  4.2  and  4.3  of  [CM99,  Sec.  4],  Note  that 
F(Q )  =  G(Q ,  Q).  Furthermore,  F(Q)  <  G(Q,  Q because  Hg(x,x)  <  Hs(x,y ),  which  is  a 
direct  consequence  of  the  AM-GM  inequality. 

We  now  relate  the  iterates  of  Algorithm  4.2  to  the  definitions  above.  Given  that  Q ^  =  I  —  _p(fe), 
Step  2b  of  Algorithm  4.2  is  equivalent  to  the  iteration 

Q(k+1)  =  argmin  Q(Q  qW). 

tr  Q—D—d 

From  this  characterization,  we  have  the  monotonicity  property 

F(Q(fc+1))  <  G(Q(fc+1),  Q(fc))  <  G{Q(k\  Q[k))  =  F(Q <*>).  (C.7) 

This  fact  motivates  the  stopping  criterion  for  Algorithm  4.2  because  it  implies  the  objective  values 
are  decreasing:  a^k+1^  =  F(Qk+1)  <  F(Q ^  =  a^k\ 

We  also  require  some  information  regarding  the  bilinear  form  induced  by  C .  Introduce  the 
quantity  m  :=  max  {<5,  max£ce^’{||*||}}.  Then,  by  symmetry  of  the  matrix  Q,  and  the  fact  that 
the  inner  product  between  positive  semidefinite  matrices  is  nonnegative  we  have 

(Q,  C(QW)Q)  >  ^tr  (q2  £  xx^j  >  ||Q||2  (Amin 

=:/* 

The  technical  assumption  that  the  observations  do  not  lie  in  two  strict  subspaces  of  implies  in 
particular  that  the  observations  span  MD.  We  deduce  that  n  >  0. 

Now  we  discuss  the  challenge  imposed  by  the  constraint  set.  When  the  minimizer  Q lies  on 
the  boundary  of  the  constraint  set,  the  equality  [CM99,  Eqn.  (4.3)]  may  not  hold.  However,  if  we 
denote  the  gradient  of  G  with  respect  to  its  first  argument  by  Gq ,  the  inequality 

o  (Q  Q  \  GQ{Q[k+l\  Q{k) )> 

=  (Q-  Q(fc+1),  VF(gW)  +  C(Q(kd{Q(k+1)  -  Q(k)))  (C.8) 

holds  for  every  Q  in  the  feasible  set.  This  is  simply  the  first-order  necessary  and  sufficient  condition 
for  the  constrained  minimum  of  a  smooth  convex  function  over  a  convex  set. 
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With  the  facts  above,  a  proof  that  the  iterates  Q ^  converge  to  Qs  follows  the  argument 
of  [CM99,  Lem.  5.1]  nearly  line-by-line.  However,  due  to  inequality  (C.8),  the  final  conclusion 
is  that,  at  the  limit  point  Q,  the  inequality  (Q  —  Q,  VF(Q))  >  0  holds  for  all  feasible  Q.  This 
inequality  characterizes  the  global  minimum  of  a  convex  function  over  a  convex  set,  so  the  limit 
point  must  indeed  be  a  global  minimizer.  That  is,  Q  =  Qs-  In  particular,  this  argument  shows 
that  the  iterates  P ^  converge  to  Ps  :=  I  —  Qs  as  k  — »  oo. 

The  only  remaining  claim  is  that  Ps  =  I  Qs  nearly  minimizes  (1.4).  We  abbreviate  the  objective 
of  (1.4)  under  the  identification  Q  =  I  —  P  by 

110*11  ■ 

Define  Q*  :=  arg  minFo(Q)  with  respect  to  the  feasible  set  0  ^  Q  ^  I  and  tr(Q)  =  D  —  d.  From 
the  easy  inequalities  x  <  Hs(x,  x)  <  x  +  ^5  for  x  >  0,  we  see  that 

o  <  f(Q)  -  Fq(Q)  <  hs\x\. 

Evaluate  the  latter  inequality  at  Qs,  and  subtract  the  result  from  the  inequality  evaluated  at  Q* 
to  reach 

(F(Q*)  -  F(Qs))  +  {Fo (Qs)  ~  F0(Q *))  <  hs  |Af|  . 

Since  Qs  and  Q*  are  optimal  for  their  respective  problems,  both  terms  in  parenthesis  above  are 
positive,  and  we  deduce  that  Fo(Qs)  ~  Fo(Q^)  <  ^5  \X\.  Since  To  is  the  objective  function  or  (1.4) 
under  the  map  P  =  I  —  Q,  the  proof  is  complete.  □ 

Appendix  D.  Proof  of  Theorem  6.2 

We  need  to  verify  that  the  weak  reaper  problem  (6.1)  has  a  solution  that  satisfies  the  constraints 
of  the  reaper  problem.  The  argument  is  constructive.  Given  a  feasible  point  P  for  (6.1),  we  convert 
it  into  a  matrix  P'  that  has  a  smaller  objective  value  and  satisfies  the  constraint  P'  =<!  I.  Since 
the  weak  reaper  problem  is  a  relaxation  of  the  reaper  problem,  we  can  convert  a  solution  of  the 
weak  reaper  problem  into  a  solution  to  the  REAPERproblem. 

Let  P  be  feasible  for  (6.1),  and  suppose  that  P  =  U AUl  is  an  eigenvalue  decomposition  with 
eigenvalues  arranged  in  weakly  decreasing  order:  Ai  >  •  •  •  >  Ad  >  0.  We  construct  a  new  diagonal 
matrix  A'  whose  diagonal  elements  satisfy 

(i)  0  <  A'  <  1  for  each  i  =  1, . . . ,  D. 

(ii)  E?=iA'=trP. 

(hi)  (1  -  A')2  <  (1  -  A*)2 

Then  we  set  P'  =  UA'U1.  Let  us  demonstrate  that  this  matrix  has  the  required  properties. 

Conditions  (i)  and  (ii)  imply  that  P '  is  feasible  for  reaper.  We  only  need  to  check  that  the 
objective  value  of  (weak)  reaper  is  the  same  at  P  and  P' .  By  unitary  invariance,  we  compute 
that,  for  each  vector  x, 

||(I  -  =  ||(I  -  A 'X^aOH2  =  £(1-  -  A')2(C/t*)2  <  £(1  -  Aj)2(C/tcc)2  =  || (I  -  P)x ||2. 

i=  1  i=  1 

Summing  over  x  6  X ,  we  see  that  the  objective  value  of  (weak)  reaper  at  P'  does  not  exceed  the 
value  at  P. 

We  construct  the  desired  matrix  A'  by  a  water-filling  argument.  Let  z*  be  the  smallest  integer 
such  that  E}=i  Aj  <  z*.  This  number  is  well  defined  because  l  Aj  =  d  and  0  <  d  <  D. 
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Furthermore,  since  the  eigenvalues  are  listed  in  decreasing  order, 

A  i  <  —  ^2  A  j  <1  for  i  >  i-y. 

l*j= i 


Set 


Ai  =  ^  1  i*  +  J2j*=i  Aj, 


z  <  z* 


z  =  z* 


Aj,  i  >  z*. 

We  just  need  to  verify  that  A'  satisfy  Conditions  (i)-(iii). 
Condition  (i)  is  evident  for  i  /  z*.  When  i  =  z*, 

A't  =  1  -  z*  +  ^  Aj  <  1 

i=i 


(D.l) 


(D.2) 


because  the  sum  is  at  most  z*.  Furthermore,  since  z*  is  the  smallest  integer  with  the  specified 
property,  we  have 

«*- 1 

X*  =  'W  +  X!  _  (**  _  !)  >  >  0-  (D-3) 

i=i 

The  last  inequality  holds  because  A*  >  0  for  each  i. 

For  Condition  (ii),  we  split  the  sum  of  A'  into  three  pieces  according  to  the  cases  in  (D.l),  and 
we  compute 


D 


EX 


i* 

D 

1  ~  **  +  Yl  Aj 

+ 

E  x 

i=1 

_j=i*+ 1 

D 


=  Ex 


tr  P . 


Condition  (iii)  clearly  holds  for  each  z  /  z*.  For  z  =  z*,  observe  that  0  <  A^  <  A^  <  1  because 
of  (D.2)  and  (D.3).  The  claim  follows. 
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